An Early Experience on Job Checkpoint/Restart - Working with SGI Irix OS and the Portable Batch System (PBS) Sherry Chang

Similar documents
Quick Start Guide. by Burak Himmetoglu. Supercomputing Consultant. Enterprise Technology Services & Center for Scientific Computing

User Guide of High Performance Computing Cluster in School of Physics

WhatÕs New in the Message-Passing Toolkit

Batch Systems. Running calculations on HPC resources

Quick Start Guide. by Burak Himmetoglu. Supercomputing Consultant. Enterprise Technology Services & Center for Scientific Computing

Quick Guide for the Torque Cluster Manager

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

XSEDE New User Tutorial

Job Management on LONI and LSU HPC clusters

Cluster Network Products

Please include the following sentence in any works using center resources.

Introduction to GALILEO

CS/Math 471: Intro. to Scientific Computing

Experiences with LSF and cpusets on the Origin3800 at Dresden University of Technology

Introduction to HPC Resources and Linux

(7) Get the total energy from the output (1 Hartree = kcal/mol) and label the structure with the energy.

SGI OpenFOAM TM Quick Start Guide

Shell Scripting. With Applications to HPC. Edmund Sumbar Copyright 2007 University of Alberta. All rights reserved

New User Tutorial. OSU High Performance Computing Center

Batch Systems. Running your jobs on an HPC machine

PBS Pro Documentation

S i m p l i f y i n g A d m i n i s t r a t i o n a n d M a n a g e m e n t P r o c e s s e s i n t h e P o l i s h N a t i o n a l C l u s t e r

IRIX Resource Management Plans & Status

Linux System Administration

OpenPBS Users Manual

XSEDE New User Tutorial

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

DDT: A visual, parallel debugger on Ra

PROGRAMMING MODEL EXAMPLES

XSEDE New User Tutorial

Numerical Aerospace Simulation (NAS) Systems Environment. MPI Overview

Basic Q- Chem Calcula/ons using IQmol

L14 Supercomputing - Part 2

Introduction to PICO Parallel & Production Enviroment

How to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions

Production and Operational Use of the Cray X1 System

UoW HPC Quick Start. Information Technology Services University of Wollongong. ( Last updated on October 10, 2011)

NBIC TechTrack PBS Tutorial. by Marcel Kempenaar, NBIC Bioinformatics Research Support group, University Medical Center Groningen

Cluster Tools Batch Queue Job Control W. Trevor King May 20, 2008

XSEDE New User Tutorial

Image Sharpening. Practical Introduction to HPC Exercise. Instructions for Cirrus Tier-2 System

Supercomputing environment TMA4280 Introduction to Supercomputing

Unix Processes. What is a Process?

Introduction to CINECA Computer Environment

Working on the NewRiver Cluster

Computing on Mio Data & Useful Commands

User Services. Chuck Niggley Computer Sciences Corp. NASA Ames Research Center CUG Summit 2002 May 21, CUG Summit 2002, Manchester, England

GPU Cluster Usage Tutorial

Effective Use of CCV Resources

Introduction to GALILEO

HPC Resources at Lehigh. Steve Anthony March 22, 2012

High Performance Computing (HPC) Club Training Session. Xinsheng (Shawn) Qin

Introduction to Molecular Dynamics on ARCHER: Instructions for running parallel jobs on ARCHER

NBIC TechTrack PBS Tutorial

Sharpen Exercise: Using HPC resources and running parallel applications

Computational Biology Software Overview

Grid Examples. Steve Gallo Center for Computational Research University at Buffalo

SGE Roll: Users Guide. Version Edition

Introduction to the Linux Command Line January Presentation Topics

Using ITaP clusters for large scale statistical analysis with R. Doug Crabill Purdue University

Introduction to running C based MPI jobs on COGNAC. Paul Bourke November 2006

Batch environment PBS (Running applications on the Cray XC30) 1/18/2016

Scheduling Jobs onto Intel Xeon Phi using PBS Professional

Grid Engine Users Guide. 5.5 Edition

N1GE6 Checkpointing and Berkeley Lab Checkpoint/Restart. Liang PENG Lip Kian NG

Sharpen Exercise: Using HPC resources and running parallel applications

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Introduction to HPC Using zcluster at GACRC

Introduction to HPC Using the New Cluster at GACRC

Running applications on the Cray XC30

Linux Tutorial #6. -rw-r csce_user csce_user 20 Jan 4 09:15 list1.txt -rw-r csce_user csce_user 26 Jan 4 09:16 list2.

Introduction to High Performance Computing (HPC) Resources at GACRC

Using Sapelo2 Cluster at the GACRC

Introduction to Linux and Cluster Computing Environments for Bioinformatics

Introduction to Unix Environment: modules, job scripts, PBS. N. Spallanzani (CINECA)

HPC Cluster: Setup and Configuration HowTo Guide

Introduction to HPC Using zcluster at GACRC

Introduction to NCAR HPC. 25 May 2017 Consulting Services Group Brian Vanderwende

PBS Pro and Ansys Examples

Knights Landing production environment on MARCONI

Getting started with the CEES Grid

Transitioning to Leibniz and CentOS 7

The DTU HPC system. and how to use TopOpt in PETSc on a HPC system, visualize and 3D print results.

Ambiente CINECA: moduli, job scripts, PBS. A. Grottesi (CINECA)

Workshop Set up. Workshop website: Workshop project set up account at my.osc.edu PZS0724 Nq7sRoNrWnFuLtBm

First steps on using an HPC service ARCHER

Lecture 3. Unix. Question? b. The world s best restaurant. c. Being in the top three happiest countries in the world.

SGE Roll: Users Guide. Version 5.3 Edition

Cheese Cluster Training

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Altair. PBS Professional 9.1. Quick Start Guide. for UNIX, Linux, and Windows

Big Data Analytics with Hadoop and Spark at OSC

Grid Engine - A Batch System for DESY. Andreas Haupt, Peter Wegner DESY Zeuthen

Introduction to parallel computing

Introduction to Cheyenne. 12 January, 2017 Consulting Services Group Brian Vanderwende

Introduction to High Performance Computing (HPC) Resources at GACRC

Introduction to Parallel Programming with MPI

SGI Altix Running Batch Jobs With PBSPro Reiner Vogelsang SGI GmbH

Backgrounding and Task Distribution In Batch Jobs

Transcription:

An Early Experience on Job Checkpoint/Restart - Working with SGI Irix OS and the Portable Batch System (PBS) Sherry Chang schang@nas.nasa.gov Scientific Consultant NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA

Helping Our Users Scientific Consultants at NAS help users to run their jobs successfully with little or no impact on other users This presentation originated from helping with a user s case User plans to run a job that will need ~40 days to complete on O2K - Checkpoint/Restart within source code is limited - Job does not need many processors - Job does not warrant use of dedicated time on 64, 256, 512 cpus systems - Our batch queues on O2K allow maximum of 8 hours walltime

Outline Why, How, and Who? Examples : a Gaussian job and an MPI job Introduction to SGI s cpr Four Methods for checkpoint/restart - using cpr interactively - using cpr within PBS script - using qhold and qrls of PBS - using qsub -c of PBS for automatic checkpointing periodically Success and Failures Future testing and wish list

Why Do Checkpointing? Halt and restart resource-intensive codes that take a long time to run - Prevent job loss if system fails Improve a system's load balancing and scheduling Replace hardware, maintenance

How to Do Checkpoint/Restart? User code has its own checkpoint capability Example: many CFD codes certain gaussian jobs OS has built-in checkpoint/restart utility Example: The Cray Unicos OS - chkpnt and restart The SGI Irix OS - cpr implemented in 6.n releases Batch systems NQE, LSF, PBS support checkpoint/restart

Who Can Checkpoint and Restart? owner of process(es) superuser

Sample Gaussian Job Gaussian Script : o2.com O r O % nproc=2 % chk=o2.chk #p CCSD/6-31g* OPT To run a Gaussian Job % g98 o2.com o2.out & O2 Geometry Optimization 0 1 O O 1 r r 1.500

Sample MPI Job Program pi.f : calculate the value of π % mpirun -np 3./pi > pi.out & Use -miser or -cpr option to allow checkpoint/restart % mpirun -miser -np 3./pi > pi.out &

SGI's CPR Commands The cpr command provides a command-line interface for checkpoint % cpr -c statefile -p id:hid -k HID for process hierarchy (tree) rooted at that PID find information of an existing checkpoint statefile % cpr -i statefile restart % cpr -r statefile delete checkpoint statefile % cpr -D statefile

Using CPR Interactively Start a Gaussian Job % g98 o2.com o2.out & [1] 19432 (parent process ID) % ps PID TTY TIME CMD 19431 ttyq2 0:01 l101.exe 19432 ttyq2 0:00 g98 19435 ttyq2 0:02 l101.exe child process parent process child process Do the first checkpoint % cpr -c chk1 -p 19432:HID -k Checkpointing id 19432 (type HID) to directory chk1 Checkpoint done Created in working directory and -rwx- by root only Caveat: Multiple-processor Gaussian jobs do not automatically clear its 'shared memory segments' when the job is checkpointed.

Using CPR Interactively - continued Restart % cpr -r chk1 & [1] 19458 % Restarting processes from directory chk1 Process restarted successfully. [1] Done cpr -r chk1 % ps PID TTY TIME CMD 19431 ttyq2 0:27 l114.exe 19432 ttyq2 0:00 g98 A child process of g98 is restarted

Failure - 1 checkpoint stalled using cpr interactively - for mpi jobs % mpirun -miser -np 3./pi > pi.out & [1] 3527372 % cpr -c chk1 -p 3527372:HID -k (&) [2] 3539292 (no progress at all) Production systems : checkpoint stalled most of times Test-bed systems : successful

Using CPR within PBS PBS script for first checkpoint PBS script for subsequent cpr # PBS -l ncpus=2 # PBS -l mem=50mw # PBS -l walltime=1:00:00 setenv g98root /usr/local/pkg setenv $g98root/g98/bsd/g98.login cd $PBS_O_WORKDIR # PBS -l ncpus=2 # PBS -l mem=50mw # PBS -l walltime=1:00:00 setenv g98root /usr/local/pkg setenv $g98root/g98/bsd/g98.login cd $PBS_O_WORKDIR g98 o2.com o2.out & sleep 20 cpr -c chk1 -p `ps -u schang grep g98 awk {print $1} `:HID -k cpr -r chk1 & sleep 60 cpr -c chk2 -p 3971074:HID -k Find the PID of the parent process Caveat : Restart will fail if PBS stdout/stderr not present Alternative : start job and do first checkpoint interactively

Failure - 2 restart failed using cpr in PBS script - both for mpi and gaussian jobs - checkpoint/restart successful for a few cycles, restart failed in a later cycle cpr -c chk1 -p:xxxx:hid -k & cpr -r chk1 cpr -c chk2 -p:xxxx:hid -k &}successful cpr -r chk2. cpr -c chkn -p:xxxx:hid -k & cpr -r chkn } failed Error Messages: CPR Error: Failed to place mld 0 (Invalid argument) CPR Error: Unexpected status EOF CPR Error: Cleaning up the failed restart

Failure - 3 restart failed from a checkpoint state which was once successfully restarted - both for mpi and gaussian jobs cpr -c chk1 -p:xxxx:hid -k & cpr -r chk1 cpr -c chk2 -p:xxxx:hid -k & }successful cpr -r chk2. cpr -c chkn -p:xxxx:hid -k & cpr -r chkn } failed cpr -r chk2 & } failed Error Message : same as previous page

Using qhold and qrls of PBS % qsub o2.script 1121.evelyn.nas.nasa.gov Create checkpoint directory in /PBS/ when qhold % qhold 1121 % ls -l /PBS/checkpoint drwxr-xr-x 3 root root 28 May 1 12:20 1121.evelyn.CK % qstat -a Job ID S 1121.evelyn H % qrls 1121 Job ID S 1121.evelyn R % qhold % qrls Job held Job restarted Checkpoint directory updated for subsequent qhold PBS script: o2.script # PBS -l ncpus=2 # PBS -l mem=50mw # PBS -l walltime=1:00:00 setenv g98root /usr/local/pkg setenv $g98root/g98/bsd/g98.login cd $PBS_O_WORKDIR g98 o2.com o2.out

Failure - 4 (hopper) Turing % qsub mpi.pbs 8128.fermi.nas.nasa.gov Turing % qhold 8128 Job held, processes stopped, 8128.fermi..CK created Turing % qrls 8128 Turing % qhold 8128 Turing % qrls 8128 Turing % qstat -a Job ID S 8128.fermi R Turing % qdel 8128 Job ID S 8128.fermi R Job restarted successfully, processes running Job held again, 8128.fermi..CK updated Restart failed qstat says job is running. But, ps or top shows no processes running Job can not be deleted by qdel PBS needs serious clean-up

Failure -5 (t3) 33 sec 72 sec t3% qsub mpi.pbs 148.t3.nas.nasa.gov t3 % qhold 148 Job held, processes stopped, 148.t3.nas..CK created t3 % qrls 148 Job restarted successfully, processes running Job ran for ~40 seconds and then got killed t3 % qstat -a Job ID S 148.t3 R qstat says job is running. But, ps or top shows no processes running t3 % qdel 148 Job ID S 148.t3 R Job can not be deleted by qdel PBS needs serious clean-up

Using qsub -c of PBS for Automatic Checkpointing Periodically % qsub o2.script or % qsub -c c=3 o2.script 1120.evelyn.nas.nasa.gov % ls -l /PBS/checkpoint drwxr-xr-x 3 root root 28 May 1 12:05 1120.evelyn.CK PBS script : o2.script # PBS -l ncpus=2 # PBS -l mem=50mw # PBS -l walltime=1:00:00 # PBS -c c=3 setenv g98root /usr/local/pkg setenv $g98root/g98/bsd/g98.login cd $PBS_O_WORKDIR Checkpoint directory Updated every 3 minutes g98 o2.com o2.out If PBS mom or system crashes : PBS should automatically restart a job that has a checkpoint directory associated with it after the system is back

Future Testing and Wish List Future Testing : A wide variety of user applications - OpenMP, pvm, mpi Large parallel jobs System-wide checkpoint/restart System crash simulation Efficiency Ultimate Goal : Make sure checkpoint/restart is reliable in a real production environment

IRIX/CPR - A Popular Topic Recent email-exchanges on this topic, sgi-tech@cug.org Barry Sharp - Boeing Paul White - CSC Miroslaw Kupczyk - Poznan Supercomputing and Network Center Torgny Faxen - National Supercomputing Center, Sweden Irix OS NQE, LSF MPI, OpenMP, Gaussian - no pvm yet Irix vs Unicos SGI- supportfolio bug report provides limited information would like to see more exchange on the details of the success and failure cases

Acknowledgement Ed Hook - SciCon, PBS expert Lorraine Freeman - sysadmin Bron Nelson - SGI on-site analyst Chuck Niggley - SciCon Group Lead NASA Advanced Supercomputing Division CUG