CNAG Advanced User Training

Similar documents
Sherlock for IBIIS. William Law Stanford Research Computing

High Performance Computing Cluster Basic course

High Performance Computing Cluster Advanced course

Slurm basics. Summer Kickstart June slide 1 of 49

Introduction to SLURM & SLURM batch scripts

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

Slurm Workload Manager Introductory User Training

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

Batch Usage on JURECA Introduction to Slurm. May 2016 Chrysovalantis Paschoulas HPS JSC

Heterogeneous Job Support

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

Introduction to High-Performance Computing (HPC)

Case study of a computing center: Accounts, Priorities and Quotas

Slurm at the George Washington University Tim Wickberg - Slurm User Group Meeting 2015

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

ECE 574 Cluster Computing Lecture 4

Exercises: Abel/Colossus and SLURM

Deep Learning on SHARCNET:

Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Introduction to SLURM & SLURM batch scripts

Submitting batch jobs

Graham vs legacy systems

Duke Compute Cluster Workshop. 10/04/2018 Tom Milledge rc.duke.edu

Training day SLURM cluster. Context Infrastructure Environment Software usage Help section SLURM TP For further with SLURM Best practices Support TP

Introduction to RCC. September 14, 2016 Research Computing Center

Introduction to GACRC Teaching Cluster

Introduction to RCC. January 18, 2017 Research Computing Center

Introduction to GACRC Teaching Cluster

Slurm Birds of a Feather

Introduction to GACRC Teaching Cluster PHYS8602

The JANUS Computing Environment

How to run a job on a Cluster?

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013

June Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center. Carrie Brown, Adam Caprez

How to access Geyser and Caldera from Cheyenne. 19 December 2017 Consulting Services Group Brian Vanderwende

New User Seminar: Part 2 (best practices)

Submitting batch jobs Slurm on ecgate Solutions to the practicals

Introduction to High-Performance Computing (HPC)

Scientific Computing in practice

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Introduction to Slurm

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

Introduction to SLURM & SLURM batch scripts

Introduction to the Cluster

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012

Training day SLURM cluster. Context. Context renewal strategy

HPC Introductory Course - Exercises

1 Bull, 2011 Bull Extreme Computing

Batch Systems. Running your jobs on an HPC machine

Federated Cluster Support

Submitting batch jobs Slurm on ecgate

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

Triton file systems - an introduction. slide 1 of 28

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Introduction to UBELIX

Data storage on Triton: an introduction

Introduction to the Cluster

The cluster system. Introduction 22th February Jan Saalbach Scientific Computing Group

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

Using file systems at HC3

NERSC Site Report One year of Slurm Douglas Jacobsen NERSC. SLURM User Group 2016

Using Compute Canada. Masao Fujinaga Information Services and Technology University of Alberta

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Slurm. Ryan Cox Fulton Supercomputing Lab Brigham Young University (BYU)

Student HPC Hackathon 8/2018

CRUK cluster practical sessions (SLURM) Part I processes & scripts

Introduction to High Performance Computing at Case Western Reserve University. KSL Data Center

Linux Clusters Institute: Scheduling

Bash for SLURM. Author: Wesley Schaal Pharmaceutical Bioinformatics, Uppsala University

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Scientific Computing in practice

Brigham Young University Fulton Supercomputing Lab. Ryan Cox

Working with Shell Scripting. Daniel Balagué

Lustre Parallel Filesystem Best Practices

Applications Software Example

Using a Linux System 6

Using the SLURM Job Scheduler

COSC 6385 Computer Architecture. - Homework

An Introduction to Gauss. Paul D. Baines University of California, Davis November 20 th 2012

Slurm at UPPMAX. How to submit jobs with our queueing system. Jessica Nettelblad sysadmin at UPPMAX

Brigham Young University

Juropa3 Experimental Partition

TITANI CLUSTER USER MANUAL V.1.3

LAB. Preparing for Stampede: Programming Heterogeneous Many- Core Supercomputers

Directions in Workload Management

Scheduling By Trackable Resources

Troubleshooting Jobs on Odyssey

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Moab Workload Manager on Cray XT3

Description of Power8 Nodes Available on Mio (ppc[ ])

Bright Cluster Manager

Kamiak Cheat Sheet. Display text file, one page at a time. Matches all files beginning with myfile See disk space on volume

Using CSC Environment Efficiently,

Feb 22, Explanatory meeting for users of supercomputer system -- Knowhow for entering jobs in UGE --

Compiling applications for the Cray XC

Bright Cluster Manager: Using the NVIDIA NGC Deep Learning Containers

Cluster Computing in Frankfurt

Transcription:

www.bsc.es CNAG Advanced User Training Aníbal Moreno, CNAG System Administrator Pablo Ródenas, BSC HPC Support Rubén Ramos Horta, CNAG HPC Support Barcelona,May the 5th

Aim Understand CNAG s cluster design HPC Cluster: Big and complex machine unlike your desktop Understand Slurm Scheduling Knowledge of Slurm s internals Slurm tips and tweaks Understand Lustre internals Complex network underneath Who to contact? www.bsc.es/user-support/cnag.php cnag_support@bsc.es 2

Cluster Overview

Components The cluster consists of: 2 login nodes 108 GenC nodes (24 cores, 256 GB RAM) 12 GPU nodes (16 cores, 128 GB RAM) 20 GenB nodes (16 cores, 128 GB RAM) 94 GenA nodes (8 cores, 48 GB RAM) 1 SMP node (48 cores, 1TB RAM) Data is stored in: 1 NFS exported /home on all nodes 1 NFS exported /apps on all nodes 2 High Performance Distributed Filesystems (Lustre) (/project /scratch) Jobs are executed with: Slurm Resource Manager (Batch system) 4

Batch System (Slurm)

Slurm concepts Scheduling FIFO with priorities Backfilling Schedules less priority jobs but only those which meet free resources conditions. Increases utilization of the cluster Requires declaration of max execution time of jobs --time on srun DefaultTime or MaxTime on Partition 6

Backfill FIFO Scheduler Backfill Scheduler 7

Backfilling in the cluster $ sdiag [..-] Main schedule statistics (microseconds): Last cycle: 9773 Max cycle: 89151 Total cycles: 1081 Mean cycle: 6197 Mean depth cycle: 366 Cycles per minute: 1 Last queue length: 423 Backfilling stats Total backfilled jobs (since last slurm start): 1388236 Total backfilled jobs (since last stats cycle start): 1350 Total cycles: 184 Last cycle when: Thu Apr 27 14:19:05 2017 Last cycle: 14770246 Max cycle: 24826346 Mean cycle: 10639957 Last depth cycle: 416 Last depth cycle (try sched): 416 Depth Mean: 1014 Depth Mean (try depth): 584 Last queue length: 416 Queue length mean: 1014 8

Priorities Determining factors Age the length of time a job has been waiting in the queue, eligible to be scheduled. This factor will reach its highest value after 7 days QoS a factor specified to a specific QoS Fair-share the difference between the portion of the computing resource that has been promised and the amount of resources that has been consumed. This factor will consider only the last 14 days Job size the number of nodes or CPUs a job is allocated Partition a factor associated with each node partition Job_priority = (PriorityWeightAge) * (age_factor) + (PriorityWeightFairshare) * (fair-share_factor) + (PriorityWeightJobSize) * (job_size_factor) + (PriorityWeightPartition) * (partition_factor) + (PriorityWeightQOS) * (QOS_factor) + SUM(TRES_weight_cpu * TRES_factor_cpu, TRES_weight_<type> * TRES_factor_<type>,...) 9

Priorities Check the cluster weights for each factor $ sprio --weights JOBID PRIORITY AGE FAIRSHARE Weights 5000000 10000000 JOBSIZE 10000 PARTITION 20000000 QOS 100000000 Check your job s priority that is still in the queue waiting to be scheduled $ sprio -j 6486323 JOBID PRIORITY 86323 20538776 AGE 2290 FAIRSHARE 436465 JOBSIZE 22 PARTITION 20000000 QOS 100000 Final priority calculated with normalized jobs priorities $ sprio -j 6486327 -n JOBID PRIORITY AGE 6486327 0.00478336 0.0014501 PRIORITY= FAIRSHARE JOBSIZE PARTITION QOS 0.0437138 0.0022012 1.0000000 0.0010000 0.0014501*AGE_weight + 0.0437138*FAIRSHARE_weight + 0.0022012*JOBSIZE_weight + 1.0000000*PARTITION_weight + 0.0010000*QOS_weight = 20538776 10

Partitions alloc node allocated to one or more jobs mix node allocated but still with free resources idle node completely free drain node not available resv special reservations $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST main* up infinite 18 mix cnc[2,14,36-37,46-59] main* up infinite 13 alloc cnc[1,3-4,6,38-45,68] main* up infinite 77 idle cnc[5,15-35,60-67,69-108] gpu up infinite 11 resv cga[2-12] gena up infinite 17 drain* cna[21,27-28...67-68] gena up infinite 2 mix cna[24,49] gena up infinite 6 alloc cna[26,39,43,50-51,53] gena up infinite 51 idle cna[11-20,22-23,...,44-46,88] genb up infinite 4 mix cnb[8,10,12-13] genb up infinite 16 idle cnb[1-7,9,11,14-20] interactive up infinite 1 resv cga1 interactive up infinite 2 mix cna[1,8] interactive up infinite 3 alloc cna[5-7] interactive up infinite 5 idle cna[2-4,9-10] smp up infinite 1 alloc smp1 11

Accounting Limit Enforcement Accounting of resources usage by users for controlling access to resources Fair share Maximum limits MaxCPUsPerJob, MaxJobs, MaxNodesPerJob, MaxSubmitJobs, MaxWallDurationPerJob Quality of Service (QoS) ($ sacctmgr show qos) 12

sview $ ssh -X login1 $ sview 13

Slurm tips

Job arrays (I) Useful when a set of jobs share logic but differ in some parameter or input Modifying slightly your jobs, intead of launching N jobs, just need to launch a single one Job script directive: #@ array=1-n 15

Job arrays (II) Environment variables for handling your #!/bin/bash job arrays #SBATCH --job-name=arrayjob SLURM_JOB_ID SLURM_ARRAY_JOB_ID SLURM_ARRAY_TASK_ID SLURM_ARRAY_TASK_COUNT SLURM_ARRAY_TASK_MAX SLURM_ARRAY_TASK_MIN #SBATCH #SBATCH #SBATCH #SBATCH #SBATCH #SBATCH #SBATCH --output=arrayjob_%a_%a.out --error=arrayjob_%a_%a.err --array=1-1024 --time=01:00:00 --partition=main --ntasks=1 --mem-per-cpu=4000 # Print this sub-job's task ID echo "My ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID # Do some work based on the SLURM_ARRAY_TASK_ID./my_program $SLURM_ARRAY_TASK_ID 16

Job arrays (II) Job1 Job2 SLURM_JOB_ID=12780 SLURM_ARRAY_JOB_ID=6570001 SLURM_ARRAY_TASK_ID=1 SLURM_ARRAY_TASK_COUNT=1024 SLURM_ARRAY_TASK_MAX=1024 SLURM_ARRAY_TASK_MIN=1 SLURM_JOB_ID=12780 SLURM_ARRAY_JOB_ID=6570002 SLURM_ARRAY_TASK_ID=2 SLURM_ARRAY_TASK_COUNT=1024 SLURM_ARRAY_TASK_MAX=1024 SLURM_ARRAY_TASK_MIN=1... Job100 SLURM_JOB_ID=12780 SLURM_ARRAY_JOB_ID=6570100 SLURM_ARRAY_TASK_ID=100 SLURM_ARRAY_TASK_COUNT=1024 SLURM_ARRAY_TASK_MAX=1024 SLURM_ARRAY_TASK_MIN=1 Job1024 SLURM_JOB_ID=12780 SLURM_ARRAY_JOB_ID=6571024 SLURM_ARRAY_TASK_ID=1024 SLURM_ARRAY_TASK_COUNT=1024 SLURM_ARRAY_TASK_MAX=1024 SLURM_ARRAY_TASK_MIN=1 17

Job arrays (III) sbatch array=0-1024 Create a job array sbatch array=0-1024%10 Only execute 10 jobs simultaneously from this job array $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 12780_[5-1024] debug tmp mac PD 0:00 1 (Resources) 12780_1 debug tmp mac R 0:17 1 tux0 12780_2 debug tmp mac R 0:16 1 tux1 12780_3 debug tmp mac R 0:03 1 tux2 12780_4 debug tmp mac R 0:03 1 tux3 scancel 128080_[1-3] Cancel a range of tasks from a job array 18

Modify your jobscript once launched and before running scontrol update jobid=<jobid> <parameter> Timelimit Qos scontrol update jobid=9450064 timelimit=9-00:00:00 scontrol update jobid=9450064 qos=highprio Partition scontrol update jobid=9450064 partition=main Prevent a job from being started scontrol hold <jobid> Scontrol uhold <jobid> 19

Lustre Internals

What is Lustre A High Performance Parallel Distributed Filesystem High Performance Fast access Parallel Many different processes in different computers can access the same file Distributed Many different servers holding different parts of the same file, working together Meaning: Formula 1, not a 4WD 21

Why Lustre? NFS (Network FileSystem): /home and /apps Slow access Not Parallel Not distributed All users trying to acess to resources a the same time Bottleneck 22

Lustre components Metadata Servers (MDS): I/O service and network request for one or more local MDT Metadata Targets (MDT): where metadata is actually stored (RAID 1 of SSD disks) Object Storage Servers (OSS): I/O service and network request for one or more local OST Object Storage Targets (OST): where data is actually stored (RAID 5+1 SATA2 disk) 23

Lustre s High Availability 24

Lustre s stripes 25

Lustre s stripes 26

Lustre s stripes Get stripes from a specific folder $ lfs getstripe pruebas/ stripe_count: stripe_offset: pruebas//2 stripe_count: stripe_offset: pruebas//1 stripe_count: stripe_offset: pruebas 4 stripe_size: -1 1048576 4 stripe_size: -1 1048576 4 stripe_size: -1 1048576 Set number of stripes for a specific folder A lot of small files faster access with just one stripe Big files faster and distributed with several stripes $ mkdir pruebas/3 $ lfs setstripe -c 1 pruebas/3 $ lfs getstripe pruebas/3 pruebas/3 stripe_count: 1 stripe_size: Use carefully! Contact us in case of doubt 1048576 stripe_offset: -1 27

GREASY

GREASY An easy way of creating dependecies between processes Each line is a task Lines starting with # are comments Line number = task id Each line [# optional list of dependencies #] <new_task> Only backwards dependecies are allowed More info: github.com/jonarbo/greasy/blob/master/doc/greasy_usergui de.pdf 29

GREASY: Example # this line is a comment /bin/sleep 2 # following task id is 4. Tasks ids 1 and 3 don t exist /bin/sleep 4 # following task will run after completion of the sleep 2 [# 2 #] /bin/sleep 6 # following task will be run after completion of the "sleep 6" [# -2 #] /bin/sleep 8 # following task is invalid because tasks 1 and 3 do not exist [#1-3#] /bin/sleep 10 # following task will be run after completion of tasks 2, 4, 8 [#2, 4, -4 #] /bin/sleep 12 30

Data Management

Data management tips Avoid intermediate steps Use bash pipelines./do_stuff./manage_stuff_1./manage_stuff_n > out Avoid replicating data Compress as much data as possible Standard compression tar -cvf name.tar.gz <path_to_file> Parallel compression ++faster (Not in login nodes!) tar -cvf <path_to_file> pigz -p N_PROC > name.tar.gz No need to untar every time zcat, zless, zgrep, zdiff 32

Improve performance with GPU

GPU nodes: Nvidia Tesla K80 GenC processors 1.92TFlops 68 GB/s of bandwidth Improve your program performace with CUDA 2.91TFlops double precision performance 8.73TFlops single precision performance 480 GB/sec of bandwidth Some developers have a version ready to run on CUDA. Ask us to install it Ask for help if you have an idea for parallelizing your program 34

Major incidences

Running out of inodes One user created 80 millions of files in scratch Non expected use of the cluster Hard to monitor everything 36

Major incidences Sending too many jobs Use job arrays instead Affects Slurm s scheduling Cluster runs out of resources Max 100000 jobs to be scheduled at the same time Max 3000 max jobs to be backfilled at the same time If limit is exceed, no other jobs will be scheduled until queue is emptied WallClock accuracy Smarter use of resources Allows more jobs to be scheduled 37

Q&A

www.bsc.es Thank you! For further information please contact cnag_support@bsc.es 39