Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/

Similar documents
Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

Duke Compute Cluster Workshop. 10/04/2018 Tom Milledge rc.duke.edu

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

High Performance Computing Cluster Advanced course

Slurm basics. Summer Kickstart June slide 1 of 49

Introduction to SLURM & SLURM batch scripts

Introduction to SLURM & SLURM batch scripts

Graham vs legacy systems

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

Introduction to HPC Using zcluster at GACRC

Introduction to High-Performance Computing (HPC)

Introduction to HPC Using zcluster at GACRC

High Performance Computing (HPC) Using zcluster at GACRC

Introduction to RCC. September 14, 2016 Research Computing Center

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Introduction to RCC. January 18, 2017 Research Computing Center

Batch Usage on JURECA Introduction to Slurm. May 2016 Chrysovalantis Paschoulas HPS JSC

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012

Introduction to HPC Using zcluster at GACRC

High Performance Computing Cluster Basic course

June Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center. Carrie Brown, Adam Caprez

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

Introduction to HPC Using zcluster at GACRC On-Class GENE 4220

An Introduction to Gauss. Paul D. Baines University of California, Davis November 20 th 2012

Using Compute Canada. Masao Fujinaga Information Services and Technology University of Alberta

HPC Workshop. Nov. 9, 2018 James Coyle, PhD Dir. Of High Perf. Computing

Scientific Computing in practice

Exercises: Abel/Colossus and SLURM

Introduction to High-Performance Computing (HPC)

Training day SLURM cluster. Context Infrastructure Environment Software usage Help section SLURM TP For further with SLURM Best practices Support TP

How to access Geyser and Caldera from Cheyenne. 19 December 2017 Consulting Services Group Brian Vanderwende

Introduction to the Cluster

How to Use a Supercomputer - A Boot Camp

Using Cartesius and Lisa. Zheng Meyer-Zhao - Consultant Clustercomputing

HPC Introductory Course - Exercises

Introduction to SLURM & SLURM batch scripts

Using the Yale HPC Clusters

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Sherlock for IBIIS. William Law Stanford Research Computing

Training day SLURM cluster. Context. Context renewal strategy

Introduction to GACRC Teaching Cluster

Using a Linux System 6

SCALABLE HYBRID PROTOTYPE

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

CNAG Advanced User Training

How to run a job on a Cluster?

TITANI CLUSTER USER MANUAL V.1.3

Introduction to HPC Using zcluster at GACRC

MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization

Kamiak Cheat Sheet. Display text file, one page at a time. Matches all files beginning with myfile See disk space on volume

Choosing Resources Wisely. What is Research Computing?

Using the computational resources at the GACRC

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Introduction to GACRC Teaching Cluster PHYS8602

Introduction to HPC Using zcluster at GACRC

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

Introduction to UBELIX

Migrating from Zcluster to Sapelo

Cluster Clonetroop: HowTo 2014

Introduction to High Performance Computing at Case Western Reserve University. KSL Data Center

Introduction to GALILEO

UoW HPC Quick Start. Information Technology Services University of Wollongong. ( Last updated on October 10, 2011)

Introduction to GACRC Teaching Cluster

Quick Start Guide. by Burak Himmetoglu. Supercomputing Consultant. Enterprise Technology Services & Center for Scientific Computing

Heterogeneous Job Support

Introduction to NCAR HPC. 25 May 2017 Consulting Services Group Brian Vanderwende

Computing with the Moore Cluster

Slurm at UPPMAX. How to submit jobs with our queueing system. Jessica Nettelblad sysadmin at UPPMAX

Batch Systems. Running your jobs on an HPC machine

ECE 574 Cluster Computing Lecture 4

Working on the NewRiver Cluster

Introduction to PICO Parallel & Production Enviroment

For Dr Landau s PHYS8602 course

P a g e 1. HPC Example for C with OpenMPI

Introduc)on to Pacman

Introduction to CINECA Computer Environment

Beginner's Guide for UK IBM systems

Introduction to the SHARCNET Environment May-25 Pre-(summer)school webinar Speaker: Alex Razoumov University of Ontario Institute of Technology

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

CRUK cluster practical sessions (SLURM) Part I processes & scripts

Introduction to HPC Using zcluster at GACRC

Introduction to HPC2N

The cluster system. Introduction 22th February Jan Saalbach Scientific Computing Group

Introduction to Discovery.

Parallel Computing with Matlab and R

High Performance Computing (HPC) Club Training Session. Xinsheng (Shawn) Qin

Name Department/Research Area Have you used the Linux command line?

Introduction to GALILEO

Workstations & Thin Clients

Using Sapelo2 Cluster at the GACRC

Introduction to the Cluster

Submitting batch jobs

Using the Yale HPC Clusters

Introduction to HPC Resources and Linux

Compilation and Parallel Start

Kohinoor queuing document

Transcription:

Duke Compute Cluster Workshop 11/10/2016 Tom Milledge h:ps://rc.duke.edu/ rescompu>ng@duke.edu

Outline of talk Overview of Research Compu>ng resources Duke Compute Cluster overview Running interac>ve and batch jobs Using Slurm job arrays Running mul>- core and parallel jobs Specifying job dependencies Live demo and ques>ons

The Duke Compute Cluster (DCC) Formerly known as the DSCR 590 compute nodes with 7618 CPU cores (as of May 2016) Most nodes are purchased by labs and depts Some are provided by the University 500 TB of primary storage (Isilon X- series) Red Hat Enterprise Linux 6 Uses the Slurm job queueing system

Accessing the Duke Compute Cluster From on- campus (University or DUHS networks ): ssh dscr-slogin-01.oit.duke.edu (or dscr-slogin-02) From off campus, first ssh login.oit.duke.edu and then ssh to an slogin node Alterna>vely, use the Duke VPN: h:ps://oit.duke.edu/net- security/network/remote/vpn/

Copying files and directories Use scp or rsync for Linux or Mac worksta>ons Use winscp for Windows: h:ps://winscp.net Copying a file to the DCC ( push ) % scp data001.txt netid@dscrslogin-02.oit.duke.edu:. Copying a file from the DCC ( pull ): % scp netid@dscr-slogin-02.oit.duke.edu:output.txt. Use either scp - r (small files) or rsync av (large files) Pushing a directory: rsync av dir1/ netid@dscr-slogin-01.oit.duke.edu:. or scp -r dir1/ netid@dscr-slogin-01.oit.duke.edu:. Pulling a directory: rsync av netid@dscr-slogin-01.oit.duke.edu:~/dir1. scp -r netid@dscr-slogin-01.oit.duke.edu:~/dir1.

Duke Compute Cluster file systems /dscrhome (a symlink to /hpchome/group) Primary storage on the Isilon X- series filers 250 GB group quota (typical) Two week tape backup (TSM) /work Temporary storage for large data files Place for temporary files associated with running jobs (I/O) Not backed- up Subject to purges based on file age and/or u>liza>on levels 200 TB total volume size, unpar>>oned /scratch File storage on the local node file system Archival storage available for $92/TB/year

Slurm resources The DSCR wiki: SLURM Queueing System h:ps://wiki.duke.edu/display/scsc/slurm+queueing +System Official SLURM docs h:p://schedmd.com/slurmdocs Older SLURM documenta>on h:ps://compu>ng.llnl.gov/linux/slurm/slurm.html Comes up a lot in Google searches outdated use schedmd.com instead

Running an interac>ve job Reserve a compute node by typing srun --pty bash -i tm103@dscr-slogin-02 ~ $ srun --pty bash -i srun: job 186535 queued and waiting for resources srun: job 186535 has been allocated resources tm103@dscr-encode-11 ~ $ tm103@dscr-encode-11 ~ $ squeue -u tm103 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 186535 common bash tm103 R 0:14 1 dscr-encode-11 I now have an interac>ve session in the common par>>on on node dscr- encode- 11

Slurm jobs run in par>>ons Most DCC par>>ons are dept- owned machines These can only be used by members of the group Slurm par>>ons can overlap Submit to par>>ons with - - par>>on= or - p, e.g. #SBATCH p (partition name) (in a script) or srun p (partition name) --pty bash i (interac>vely) The default DCC par>>on is called common The common par>>on includes most nodes Subminng to a group par>>on gives high- priority A high- priority job can preempt (force- suspend) common par>>on jobs

GPU nodes Use -p gpu-common (RHEL6) or p gpu7-common (RHEL7) tm103@dscr-slogin-02 ~ $ srun -p gpu-common --pty bash -i tm103@dscr-gpu-01 ~ $ /usr/local/cuda-7.5/samples/1_utilities/devicequery/ devicequery /usr/local/cuda-7.5/samples/1_utilities/devicequery/devicequery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "Tesla K80" CUDA Driver Version / Runtime Version 7.5 / 7.5 CUDA Capability Major/Minor version number: 3.7 Total amount of global memory: 11520 MBytes (12079136768 bytes) (13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores... tm103@dscr-slogin-02 ~ $ srun -p gpu7-common --pty bash -i tm103@dscr-gpu-21 ~ $

SLURM commands sbatch Submit a batch job (like qsub ) #SBATCH Specify job parameters (like #$ ) squeue (like qstat ) Show lists of jobs scancel (like qdel ) Delete one or more batch jobs sinfo (like qhost ) Show info about machines scontrol Show cluster configura>on informa>on

sbatch Use sbatch (all lower case) to submit sbatch test.q Use #SBATCH (upper case) in your scripts for scheduler direc>ves, e.g. #SBATCH --mem=1g #SBATCH --output=matlab.out All SLURM direc>ves can be given on the command line instead of the script. h:p://slurm.schedmd.com/sbatch.html

sbatch example #!/bin/bash # #SBATCH --output=test.out uname n # print hostname This prints the name of the compute node in the file "test.out tm103@dscr-slogin-02 ~/slurm $ sbatch simple.q Submitted batch job 186554 tm103@dscr-slogin-02 ~/slurm $ cat test.out dscr-compeb-14

Long- form commands example #!/bin/bash #SBATCH --output=slurm.out #SBATCH --error=slurm.err #SBATCH --mem=100 # 100 MB RAM #SBATCH - partition=abyss # e.g. uname -n 1>&2 #prints hostname to the error file For a user in the abyss group, this job will run in high priority on an abyss lab node.

Short- form commands example SLURM short commands don t use = signs #!/bin/bash #SBATCH -o slurm.out #SBATCH -e slurm.err #SBATCH - mem=4g # 4 GBs RAM #SBATCH p randleslab #only for members uname -n 1>&2 #prints hostname to the error file

Matlab example script #!/bin/bash #SBATCH J matlab #SBATCH o slurm.out #SBATCH --mem=4g # 4 GB RAM /opt/apps/matlabr2015a/bin/matlab - nojvm -nodisplay singlecompthread -r my_matlab_program The singlecompthread command is required to prevent uncontrolled mul>threading

Slurm memory direc>ves This is a hard limit always request a li:le more - - mem=<mb> The amount of memory required per node - - mem- per- cpu=<mb> The amount of memory per CPU core For mul>- threaded jobs Note: - - mem and - - mem- per- cpu are mutually exclusive

Slurm parallel direc>ves All parallel direc>ves have defaults of 1 - N <number> How many nodes (machines) - n <number> or - - ntasks=<number> How many parallel jobs ( tasks ) - c, - - cpus- per- task=<ncpus> Use - c for mul>- threaded jobs The - - ntasks default is one CPU core per task, but the - - cpus- per- task op>on will change this default.

Mul>- threaded (mul>- core) example!/bin/bash #SBATCH J test #SBATCH o test.out #SBATCH c 4 #SBATCH -mem-per-cpu=500 #(500 MB) myapplication n $SLURM_CPUS_PER_TASK The value of $SLURM_CPUS_PER_TASK is the number awer - c This example starts a single, mul>- threaded job that uses 4 CPU cores and 2 GB (4x500MB) of RAM

OpenMP mul>core example!/bin/bash #SBATCH J openmp-test #SBATCH o slurm.out #SBATCH c 4 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK myopenmpapp # will run on 4 CPU cores This sets $OMP_NUM_THREADS to the value of $SLURM_CPUS_PER_TASK

Slurm job arrays a mechanism for subminng and managing collec>ons of similar jobs Job arrays are only supported for batch jobs and the array index values are specified using the --array or -a op>on h:p://slurm.schedmd.com/job_array.html

Matlab job array example #!/bin/bash #SBATCH --output=matlab.out #SBATCH --array=1-100 #SBATCH --mem=4g /opt/apps/matlab/r2012b/bin/matlab -nojvm -nodisplay singlecompthread -r "rank=$slurm_array_task_id;my_prog;quit Start 100 Matlab programs, each with a different rank, e.g. 1,2,... 100

Python job array example #!/bin/bash #SBATCH e slurm.err #SBATCH --array=1-5000 python mycode.py $ cat test.py import os rank=int(os.environ['slurm_array_task_id'])... Start 5000 Python jobs, each with a different rank, ini>alized from SLURM_ARRAY_TASK_ID

Running MPI jobs Supported MPI versions - Intel MPI - OpenMPI - LAM MPI SLURM MPI jobs use - - ntasks=(num) h:ps://wiki.duke.edu/display/scsc/running+mpi+jobs

Compiling with OpenMPI tm103@dscr-slogin-02 ~ $ export PATH=/opt/ apps/slurm/openmpi-2.0.0/bin:$path tm103@dscr-slogin-02 ~ $ which mpicc /opt/apps/slurm/openmpi-2.0.0/bin/mpicc tm103@dscr-slogin-02 ~ $ mpicc -o openhello /dscrhome/tm103/misc/slurm/openmpi/ hello.c tm103@dscr-slogin-02 ~ $ ls -l openhello -rwxr-xr-x. 1 tm103 scsc 9184 Sep 1 16:08 openhello

OpenMPI job script #!/bin/bash #SBATCH -o openhello.out #SBATCH -e openhello.err #SBATCH -n 20 export PATH=/opt/apps/slurm/openmpi-2.0.0/bin:$PATH mpirun -n $SLURM_NTASKS openhello

OpenMPI example output tm103@dscr-slogin-02 ~/misc/slurm/openmpi $ cat openhello.out dscr-core-01, rank 0 out of 20 processors dscr-core-01, rank 1 out of 20 processors dscr-core-01, rank 2 out of 20 processors dscr-core-01, rank 3 out of 20 processors dscr-core-01, rank 4 out of 20 processors dscr-core-03, rank 13 out of 20 processors dscr-core-03, rank 14 out of 20 processors dscr-core-03, rank 10 out of 20 processors dscr-core-03, rank 11 out of 20 processors dscr-core-03, rank 12 out of 20 processors dscr-core-02, rank 8 out of 20 processors dscr-core-02, rank 9 out of 20 processors dscr-core-02, rank 5 out of 20 processors...

Intel MPI example #!/bin/bash #SBATCH --ntasks=20 #SBATCH --output=intelhello.out export I_MPI_PMI_LIBRARY=/opt/slurm/ lib64/libpmi.so source /opt/apps/intel/intelvars.sh srun -n $SLURM_NTASKS intelhello

Job dependencies h:ps://hcc- docs.unl.edu/display/hccdoc/job +Dependencies Start job dep2 awer job dep1 $ sbatch dep1.q Submitted batch job 666898 Make a note of the assigned job ID of dep1 $ sbatch --dependency=afterok:666898 dep2.q Job dep2 will not start un>l dep1 finishes

Job dependencies with arrays Wait for specific job array elements sbatch --depend=after:123_4 my.job sbatch --depend=afterok:123_4:123_8 my.job2 Wait for en>re job array to complete sbatch --depend=afterany:123 my.job Wait for en>re job array to complete successfully sbatch --depend=afterok:123 my.job Wait for en>re job array to complete and at least one task fails sbatch --depend=afternotok:123 my.job

Live demo notes df h srun --pty bash -i squeue more squeue -S S squeue S S grep v PD squeue u (NetID) sbatch (job script) scancel (job id) uname n, sleep, top sinfo grep v common scontrol show node (node name) scontrol show job (job id)