Troubleshooting Jobs on Odyssey

Similar documents
Choosing Resources Wisely. What is Research Computing?

Introduction to the Cluster

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

June Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center. Carrie Brown, Adam Caprez

Introduction to High-Performance Computing (HPC)

CRUK cluster practical sessions (SLURM) Part I processes & scripts

HPC Introductory Course - Exercises

Sherlock for IBIIS. William Law Stanford Research Computing

Federated Cluster Support

Slurm basics. Summer Kickstart June slide 1 of 49

Introduction to the Cluster

Submitting batch jobs

Slurm at UPPMAX. How to submit jobs with our queueing system. Jessica Nettelblad sysadmin at UPPMAX

How to access Geyser and Caldera from Cheyenne. 19 December 2017 Consulting Services Group Brian Vanderwende

Slurm at UPPMAX. How to submit jobs with our queueing system. Jessica Nettelblad sysadmin at UPPMAX

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

Case study of a computing center: Accounts, Priorities and Quotas

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013

An Introduction to Gauss. Paul D. Baines University of California, Davis November 20 th 2012

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

COSC 6385 Computer Architecture. - Homework

CNAG Advanced User Training

Exercises: Abel/Colossus and SLURM

For Dr Landau s PHYS8602 course

Submitting batch jobs Slurm on ecgate Solutions to the practicals

Brigham Young University

Slurm at the George Washington University Tim Wickberg - Slurm User Group Meeting 2015

High Performance Computing Cluster Advanced course

Introduction to Slurm

Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/

Introduction to SLURM & SLURM batch scripts

Distributed Memory Programming With MPI Computer Lab Exercises

Duke Compute Cluster Workshop. 10/04/2018 Tom Milledge rc.duke.edu

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

ECE 574 Cluster Computing Lecture 4

High Performance Computing Cluster Basic course

1 Bull, 2011 Bull Extreme Computing

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

Using and Modifying the BSC Slurm Workload Simulator. Slurm User Group Meeting 2015 Stephen Trofinoff and Massimo Benini, CSCS September 16, 2015

Introduction to BioHPC

Introduction to High Performance Computing at Case Western Reserve University. KSL Data Center

Diagnostics in Testing and Performance Engineering

Introduction to GACRC Teaching Cluster

Introduction to SLURM & SLURM batch scripts

Introduction to GACRC Teaching Cluster PHYS8602

Using Compute Canada. Masao Fujinaga Information Services and Technology University of Alberta

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Scheduling. Scheduling. Scheduling. Scheduling Criteria. Priorities. Scheduling

Moab Workload Manager on Cray XT3

How to run a job on a Cluster?

Introduction to High-Performance Computing (HPC)

COMP Superscalar. COMPSs at BSC. Supercomputers Manual

Slurm Version Overview

Introduction to HPC Resources and Linux

P a g e 1. HPC Example for C with OpenMPI

Introduction to RCC. September 14, 2016 Research Computing Center

Graham vs legacy systems

Using Cartesius and Lisa. Zheng Meyer-Zhao - Consultant Clustercomputing

Network Administration/System Administration (NTU CSIE, Spring 2018) Homework #1. Homework #1

Running Jobs on Blue Waters. Greg Bauer

Martinos Center Compute Cluster

Submitting batch jobs Slurm on ecgate

UPPMAX Introduction Martin Dahlö Valentin Georgiev

Introduction to RCC. January 18, 2017 Research Computing Center

How to Use a Supercomputer - A Boot Camp

Process States. Controlling processes. Process states. PID and PPID UID and EUID GID and EGID Niceness Control terminal. Runnable. Sleeping.

Linux Essentials. Smith, Roderick W. Table of Contents ISBN-13: Introduction xvii. Chapter 1 Selecting an Operating System 1

Using CSC Environment Efficiently,

Introduction to GACRC Teaching Cluster

A Guide to Condor. Joe Antognini. October 25, Condor is on Our Network What is an Our Network?

Introduction to HPC Resources and Linux

Linux Tutorial. Ken-ichi Nomura. 3 rd Magics Materials Software Workshop. Gaithersburg Marriott Washingtonian Center November 11-13, 2018

Training day SLURM cluster. Context Infrastructure Environment Software usage Help section SLURM TP For further with SLURM Best practices Support TP

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Scheduling By Trackable Resources

Brigham Young University Fulton Supercomputing Lab. Ryan Cox

Training day SLURM cluster. Context. Context renewal strategy

Introduction to UBELIX

What is Research Computing?

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Introduction to High Performance Computing and an Statistical Genetics Application on the Janus Supercomputer. Purpose

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

High Performance Computing How-To Joseph Paul Cohen

Scientific Computing in practice

10 MONITORING AND OPTIMIZING

OBTAINING AN ACCOUNT:

Introduction to Process in Computing Systems SEEM

XSEDE New User Tutorial

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Improving User Accounting and Isolation with Linux Kernel Features. Brian Bockelman Condor Week 2011

LAB. Preparing for Stampede: Programming Heterogeneous Many- Core Supercomputers

Modules and Software. Daniel Caunt Harvard FAS Research Computing

APACHE TROUBLESHOOTING. Or, what to do when your vhost won t behave

Lawrence University Web Time Entry for Salaried Staff Employees

5520/4520 Wireless Network and Protocols Project 3 (due 12/02/2013)

Troubleshoot High CPU on ASR1000 Series Router

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

Transcription:

Troubleshooting Jobs on Odyssey Paul Edmon, PhD ITC Research CompuGng Associate Bob Freeman, PhD Research & EducaGon Facilitator XSEDE Campus Champion

Goals Tackle PEND, FAIL, and slow performance issues Highlight different approaches to troubleshooting Arm you with useful SLURM and Unix commands Enable you to Work smarter, better, faster

PEND Problems Why are you pending? squeue - u USERNAME - t PD squeue - u USERNAME - t PD - o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R %C"

PEND Problems Why are you pending? Most common reasons are... Reason Solution None Scheduler hasn t gotten around to you yet. Hang tight... Resources The job is waiting for resources to become available Priority One or more higher priority jobs exist for this partition or reservation. Priority is influenced by age, job size, partition, QoS, and Fairshare. Dependency This job is waiting for a dependent job to complete (- - dependency)

PEND Problems: Resources What did you request? What are the parameters of your SLURM submission script? scontrol show jobid -dd JOBID

PEND Problems: Priority What is your priority in the scheduling queue? showq-slurm -U showq-slurm -p PARTITION -o

PEND Problems: Fair-share What is a fair- share score? Score assigned to each lab that affects the scheduling priority of jobs. A user s current and past usage is considered when determining the scheduling of job execution. How is it calculated? Score is based on usage and shares: 0 Usage 1, represents your proportional use of Odyssey Shares ~= slices of a pie, or the part of Odyssey that is yours Premise: Usage == Shares, you ve hit your fairshare target Fairshare Factor = 2 - Usage/Shares 0 < FF < 1 When usage increases, FF decreases When usage decreases, FF increases

PEND Problems: Fair-share What is my score? sshare -u USERNAME

FAIL Problems Determining the root cause is highly dependent on your ability to trace & document the problem Do you have log/error files? #SBATCH -o job.stdout.txt #SBATCH -e job.stderr.txt Did you use the - e parameter to get an error file? If omitted, STDERR is redirected to -o file or slum-jobid.out Is the path non- existent? The path to place the output files must exist! Otherwise, SLURM won t know what to do and will do nothing but FAIL. What else is going on in your SLURM submission script?

FAIL Problems Look at the fail trail in any log and/or error files. Nota bene! The last error listed may not be the root cause!

FAIL Problems segmentation faults shared/static library errors Are the correct software packages and versions loaded? Are there version conflicts? Are too many things loaded in your.bashrc?

Poor performance Much harder to diagnose! Transient issues: other jobs on the node network storage bad node Software install issue Code compile/optimization problem

Poor Performance Slow run times? Most common reasons are... Reason SLURM submission errors Diagnostics & Potential solutions Check - - mem/- - mem- per- cpu Check cores (- n) + nodes (- N) Try running job interactively Overloaded Storage Is the mount point broken? Is the storage being hammered? Don t run out of home/lab directories. Is one file being accessed by hundreds of jobs? Make copies. Too many files in one directory? Check status.rc Overloaded Node Sick Node Too much Input/Output for the given network interface Overuse of CPUs relative to SLURM request Check status.rc; use squeue and top Memory errors /scratch filled up use lsload to look at node diagnostics Sick Code Monitor/trace program execution with strace Look at code performance with perf

Poor performance: Nagios Compute Check the state of the systems at https://status.rc.fas.harvard.edu Heavy load, but normal Heavy load, oversubscribed

Poor performance: Nagios Storage Includes storage like /n/regal and /n/holyscratch

Poor performance: Node health Check the health of the compute node (abused by other jobs?) need to figure out which compute node your job is on squeue - j JOBID quick check using https://status.rc.fas.harvard.edu look at load using lsload look at jobs on node using squeue log in to node and profile top strace perftrace

Poor performance: Node health Use lsload to examine the utilization of individual nodes

Poor performance What other jobs are running on my node? squeue - w NODENAME e.g. Ensure that the # of CPUs requested = # CPUs used

Poor performance Log in to node and top ssh USERNAME@NODENAME top

Poor performance Log in to node and monitor code execution ssh USERNAME@NODENAME ps aux grep USERNAME (obtain relevant process ID) strace - p PROCESSID

Research Computing Please talk to your peers, and We wish you success in your research! http://rc.fas.harvard.edu https://portal.rc.fas.harvard.edu rchelp@fas.harvard.edu @fasrc Harvard Informatics @harvardifx