Boost your efficiency when dealing with multiple jobs on the Cray XC40 supercomputer Shaheen II. KAUST Supercomputing Laboratory KSL Workshop Series

Boost your efficiency when dealing with multiple jobs on the Cray XC40 supercomputer Shaheen II Samuel KORTAS KAUST Supercomputing Laboratory KSL Workshop Series June 5th t 2016

Agenda A few tips when dealing with numerous jobs Slurm way (up to a limit) Four KSL tools to move you further Breakit (1 to 10000s, all same) KTF (1 to 100, tuned) Avati (1 to 1000s, programmed) Decimate (dependent jobs) Hands-out session: /scratch/tmp/ksl_workshop Documentation on hpc.kaust.edu.sa/1001_jobs (to be completed today) Conclusion

Launching thousands of jobs Some of our users use shaheen to explore parameters sweeping involving thousands of jobs saving thousands of temporary files Need a result in a guaranteed time Are not hpc experts, but are challenging problem in terms of scheduling and file system stress. Implement complex workflows sending the output of one code into the input of others and producing a lot of small files

Scheduling thousands of jobs KSL does its best but it's not that easy folks! The tetris Game gets rough with long rectangles ;-( Time X 1000s!!!! 6144 Nodes availables `

Let's help the scheduler! (1/5) Putting the right elapsed time

Let's help the scheduler! (2/5) Let's share resources better among us Current policy of scheduler is first in first served Your priority increases as long as you are waiting 'actively' in the queue, hold or dependent jobs are not counted Slurm takes into account your backfilling potential But we have to share guys number of jobs in the queue is limited Fair share slurm implementation is reported to work well with only a small number of projects

Let's help the scheduler! (3/5) Let's lower the stress on the filesystem Each one of the 1000s jobs may need to read, probe or write a file. We got a unique filesystem shared by all the jobs, let's save it Lustre is not tuned for little files Let's use ramdisk when it's possible and save data that matters to Lustre (see next slide) Let's communicate in memory instead of via files Let's choose the right stripe count

Let's help the scheduler! (4/5) How to use ramdisk? On each shaheen II computing, /tmp is a ramdisk, a POSIX filesystem hosted directly in memory starting at 64 GB, it shrinks as your program uses more and more memory an additional memory requests or a write in /tmp fails when : size(os) + size(program instructions) + size(program variable) + size(/tmp) > 128 GB Still /tmp is the fastest filesystem of all (compared to lustre and datawarp) But it's distibuted and lost at the end of the job. think of storing temporary files in /tmp and save them at the end of the job think of storing frequently accessed files in /tmp

Let's help the scheduler! (5/5) Off-loading the cdls to compute nodes You may need to Pre/postprocess Monitor a job Relaunch it Get notified when it's starting or ending... Automate all this and move the load from the cdl to the compute nodes Use #SBATCH mail-user Use breakit, ktf, maestro, decimate Ask KSL team for help: it's only a script away

Managing 1001 jobs 1 - the SLURM way submitting Arrays...

Slurm Way (1/3) Slurm can Submit and manage collection of similar jobs easily job_array To submit 500 element job array: sbatch --array=1-500 -N1 -i my_in_%a -o my_out_%a job.sh where %a in file name mapped to array task ID (1 500) squeue -r -user <my_user_name> 'unfolds' job queued as job array More info at http://slurm.schedmd.com/job_array.html

Slurm Way (2/3) Job environment variables squeue and scancel commands plus some scontrol options can operate on entire job array or select task IDs squeue -r option prints each task ID separately

Slurm Way (3/3) Job example Possible commands: sbatch --array=1-16 my_job sbatch --array=1-500%20 my_job only allow 20 active running jobs at a given time Taken from https://rcc.uchicago.edu/docs/running-jobs/array/index.html

Slurm Way But Slurm count each job of the array as a job per se: as for now the total number of jobs in the queue is limited to 800 jobs per user Pending job are not gaining priority Only one parameter can vary if need to work on several parameter, the script himself has to deduce them from the number in the array...

Slurm Way hands-on Submit the job /scratch/tmp/ksl_workshop/slurm/demo.job As an array of 20 occurrences, check the script, its output The queue Cancel it

Slurm Way hands-on solution Submit the job /scratch/tmp/ksl_workshop/slurm/demo.job sbatch array=1-20 /scratch/tmp/ksl_workshop/slurm/demo.job As an array of 20 occurrences, check the script, its output, The queue, squeue -r --user=<my_user> Cancel it scancel -n <my_job_name>

Managing 1001 jobs? 4 KSL open source Tools

Why? Ease your life and centralize some common developments breakit ktf maestro Availaible on shaheen as modules decimate Under development for 2 PIs released soon on bitbucket.org Soon Available at https://bitbucket.org/kaust_ksl/ (GNU GPL License) Written in python 2.7 Installed on Shaheen II, Portable on workstation, Noor Our Goal: Hiding Complexity All share common api and internal library engine also available on bitbucket.org/kaust_ksl Maintained by KSL (samuel.kortas (at) kaust.edu.sa)

Managing 1001 jobs Using the breakit wrapper

Breakit (1/3) Idea and status To allow you to cope seamlessly with the limit of 800 jobs No need to change your job array Breakit automatically monitors the process for you version 0.1 I need your feedback!

Slurm way (1/2) How to handle it with slurm? You Or prog on cdl Max number of jobs in queue

Slurm way (2/2) How to handle it with slurm You Or prog on cdl Max number of jobs in queue

Breakit (2/3) How does it work? breakit Max number of jobs

Breakit (2/3) How does it work? Gone! Max number of jobs Breakit is not active anymore!

Breakit (2/3) How does it work? Gone!t Max number of jobs The jobs are starting

Breakit (2/3) How does it work? Max number of jobs They submit the next jobs with a dependency

Breakit (2/3) How does it work? Max number of jobs First stop are done dependency is solved Next ones are prending

Breakit (2/3) How does it work? Max number of jobs They submit the next jobs with a dependency

Breakit (2/3) How does it work? Instead of submitting all the jobs, they are submitted by chunks Chunk #n is running or pending Chunk #n+1 is depending on Chunk #n, Starts only when every jobs of chunk #n have completed Submit Chunk #n+2 setting a dependency on Chunk # n+1.we did offload some task from the cdl on compute nodes ;-)

Breakit (3/3) How to use it? 1) Load the breakit module module load breakit man breakit (to be completed) breakit -h 2) Launch your job: breakit --job=your.job array=<nb of jobs> --chunk=<max_nb_of_jobs_in_queue> 3) Manage it: squeue -r -u <user> -n <job_name> scancel -n <job_name>

Breakit Hands on Via breakit submit an array of 100 occurrences of job /scratch/tmp/breakit/demo.job only having 16 jobs simultaneously in the queue

Breakit Hands on (solution) Via breakit submit an array of 100 occurrences of job /scratch/tmp/breakit/demo.job only having 16 jobs simultaneously in the queue module load breakit breakit --job=/scratch/tmp/breakit/demo.job --range=100 --chunk=16

Breakit Next steps Find a better name! Support all array range (not only 1-n) Provide an easy restart Provide an easier way to kill jobs

Managing 101 jobs Using KTF

KTF Idea At a certain point, you may need: to evaluate the performance of a code under different conditions, to run a parametric study. the same executable is run several times with a different set of parameters Physical values characterizing the problem, number of processors, threads and/or nodes compiler used compiling option parameters passed on the srun command line to experiment different placement strategies KTF (Kaust Test Framework) can help you on this!

What is KTF? KTF (Kaust Test Framework) has been designed and used during Shaheen II procurement in order to ease Generation Submission Monitoring Result collecting Of a set of jobs depending on a set of parameters to explore. Written in python 2.7 Self-contained and portable Available on bitbucket.org/kaust_ksl/ktf

How does KTF works? A few definitions An 'experiment' A case is one single run of this experiment with a given set of parameters A test gathers a number of cases

How does KTF works? KTF relies on A centralized file listing all combinations of parameters to address : ie shaheen_cases.kt A set of template files where the parameters needs to be replaced before the submission in all files ending by.template

KTF hands-on! (1/) Initialize environment 1) Load the environment, and check that ktf is available module load ktf man ktf ktf -h 2) Create and initialize your working directory mkdir <my_test_dir> cd <my_test_dir> ktf --init you should get a ktf-like tree structure with some example of centralize case files and associated templates 3) Examine the case file shaheen_cases.ktf, understands the ktf syntax, modify parameters and check your change by listing all the combinations ktf --exp

KTF Centralized case file (see file shaheen_zephyr0.ktf) KTF comment list of parameters third test case # is a comment not parsed by KTF First line gives the name of the parameters Case and Experiment are absolutely mandatory Each line following is a test case, setting value for EACH of parameter According to this case file, for the third test case, in each file ending by.template: Case will be replaced by 128 Experiment will be replaced by zephyr/stong NX will be replaced by 255 NY will be replaced by 255 NB_CORES will be replaced by 128 ELLAPSED_TIME will be replaced by 0:05:00

KTF Directory initial structure subdirectory containing files common to all the experiments one experiment directory one experiment directory default case file ktf

KTF job.shaheen.template (see files in tests/zephyr/strong/) KTF comment list of parameters third test case file job.shaheen.template./zephyr input

KTF job.shaheen.template (see files in tests/zephyr/strong/) KTF comment list of parameters third test case file input.template

KTF commands ktf... --help : get help on command line --init : initialize the environment copying example.template and.kt files --build : generate all combination listed in the case file --launch: generate all combination listed in the case file and submit them --exp : list all combination present in the case.ktf file --monitor: monitor all the experiments and displays all results in a dashboard --kill : kill all jobs related to this ktf session --status : list all stamp dates and cases of the experiments made or currently occuring

KTF hands-on! (2/) Prepare a first experiment 4) Examine the case file shaheen_cases.ktf, understands the ktf syntax, modify parameters and check your change by listing all the combinations ktf --exp 5) Build an experiment and check that the templated files have been well processed ktf --build should create one tests_ directories : tests_shaheen_<date>_<time>

KTF Directory KTF Directory after --build Initial template Third case

KTF Directory KTF Directory after --launch Zephyr is copied from the common directory job.shaheen processed from job.shaheen.template input processed from input.template

KTF Centralized case file Handling constant parameters File shaheen_zephyr0.ktf. KTF comment list of parameters third test case. strictly identical to File shaheen_zephyr1.ktf list of parameters #KTF pragma declaring new parameters that will keep same value ever after

Another example KTF case file Case Experiment Experiment

KTF filters and flags ktf --xxx... --case-file=<case file> : use another case files than shahen_cases.kt --what=zzzz : filters on some cases --reservation=<reservation name> : submit within a reservation ktf --exp --what=128 ktf --launch what=64 --reservation=workshop ktf --exp case-file=shaheen_zephyr1.ktf

KTF filters and flags ktf --xxx... --ktf-file=<case file> : use another case files than shahen_cases.ktf --what=zzzz : filters on some cases --when=yyyy --today --now : filters on some date stamps --times=<nb>: repeat submission <nb> times --info : switch on informative traces --info-level=[0 1 2 3] : change informative trace level --debug : switch on debugging traces --debug-level=[0 1 2 3] : change debugging trace level

KTF hands-on! (3/) Playing with what filter 4) Examine the case file shaheen_cases.ktf, understands the ktf syntax, modify parameters and check your change by listing all the combinations with or without filtering and using other cases files ktf --exp ktf --exp --what=<your filter> ktf --exp case-file=shaheen_zephyr1.ktf 5) Build an experiment and check that the templated files have been well processed ktf --build ktf --build --what=<your filter> should create two tests directories from where you call ktf tests_shaheen_<date>_<time>

KTF hands-on! (3/) launch and monitor our first experiment 6) Build an experiment and submit it ktf launch [ --reservation=workshop ] should create a new tests directory and spawn the jobs./tests_shaheen_<date>_<time> ktf --monitor will monitor your current ktf session check what shows in the R/ directory 7) Play with repeating experiments and filtering results ktf --launch --what=<your filter> [ --reservation=workshop ] ktf --launch --times=5 [ --reservation=workshop ] ktf --monitor ktf --monitor --what=<your case filter> --when=<your date filter> check what shows in the R/ directory

KTF results dashboard reading the result dashboard % ktf --monitor

KTF results dashboard reading the result dashboard % ktf --monitor When What! r.er Job mpty e not ir bd Su / in R us t Sta e Timt No yet hed s fini

KTF R/ directory quick access to results This R/ directory is updated each time you call kt --monitor It builds symbolic links to the results directory in order to provide you quick access to the results you want to check.

KTF R/ directory quick access to results directory ^

KTF results configuration implementation and default printing In fact alias ktf = python run_test.py alias ki = python run_test.py --init alias km = python run_test.py --monitor In run_test.py, is encoded the value to be displayed in the dashboard (printed when calling monitor) By default, it is <ellapsed time taken by the whole test>/<status of the test> with a '!' after the status if ever job.err is not empty with a '!' before the status if ever the job is not terminated properly remember you can use cat or more or tail R/*/job.err to scan all these files!

KTF results configuration changing default printing But you can change the displayed values at will! And adapt it to your own needs: Other values: Flops, intermediate results, total number of iterations, convergence rate, Several values : <flops>/<time>/<status> Other event to trigger the '!' sign Other typographic signs how to do it

KTF run_test.py file

KTF hands-on! (5/) modifying the result printed 8) Check what ktf prints of it: ktf --monitor and understand how run_test.py is working 9) Modify run_test.py in order to print the time per iteration

KTF Next steps Gather tests into campaign Have a better display --monitor option, Web interface, Automated generation of plots Enrich the filtering feature : regular expression, several filters possible Enable coding capability inside the case file Complete the documentation Save results into database and be able to compute statistics Cover the compiling step

KTF Next steps Support clean and campaigns Chains several jobs into one Support job arrays, dependencies, mail to user Port on Noor and workstation Offload from workstation to shaheen Better versioning of the template file Decline one ktf initial environment per science fields

Managing 1001 jobs using Maestro

Maestro principles (1/2) Handling these studies should be same on: A linux box Shaheen, Noor, Stampede A laptop under windows or mac OS A given set of linux boxes The only prerequisite: Python > 2.4 and MPI on a supercomputer Python > 2.4 on a workstation

Maestro principles (2/2) Minimal or no knowledge of HPC environment required Easy management of the jobs handled as a whole.

A set of tools adapted to a distributed execution (1/3) No pre-installation needed on the machines: maestro is self contained Easy and quick prototyping on workstation with immediate porting on supercomputer Global Error signals easy to throw and trace Global handling of the jobs as a whole study (launching, monitoring, killing and restarting through one command)

A set of tools adapted to a distributed execution (2/3) All the flexibility of python available to the user in a distributed environment (class inheritance, modules ) production of code robust, easy to read with an explicit error stack in case of problem to debug Transparent replication of the environment on each of the compute nodes Work in /tmp of each compute node to minimize the stress of the filesystem

A set of tools adapted to a distributed execution (3/3) Extended Grep (multi-line, multi-column, regular expressions) to postprocess the output files Centralized management of the template to replace Global selection of files to be kept and parametrization of the receiving directory A console to explore easily subdirectories where results are saved Each running process can write in a same global file

Maestro Principles maestro

Maestro Principles maestro Maestro Allocate A pool of Nodes and runs elementary job in it

An example File to save Directory name where Results are saved Elementary computation Sending local and Global messages Parametrized Z range Definition of the domain to sweep

Command line options <no option> : classical sequential run on 1 core stopping at the first error encountered --cores=<n> : parallel run on n cores --depth=<p> : partial parallelisation up to level p --stat : live status of ongoing computation --reservation=<id> : run inside a reservation --time=hh:mm:ss : set the elapsed duration of the overall job --kill : kills ongoing computation and clean environment --resume : resume a computation --restart : restart from scratch a computation --help : help screen

Demo!

Next Steps Allowing maestro to launch multicore jobs More clever sweeping algorithms decime project Support of a given set of workstation Coupling maestro with website Remote launching and dynamic off-loading from workstation to supercomputer

Managing depedent jobs in complex workflow Using Decimate

Idea Some workflow involve several steps depending of one another several jobs with a dependency between them Some intermediate steps may break dependency will break the workflow will remain idle, requesting an action We want to automate it

What is decimate? Add-ons and goodies Tool in python written for two different PIs with the same need Launch, monitor, heal dependent jobs Make things automated and smooth

What is decimate? Add-ons Centralized log files, Global resume, --status and kill command Sends a mail at any time to the user to keep him updated Can make decision when dependency is broken Relaunch same job again and fix dependency Change input data, relaunch and fix dependency cancel only this job and move on. Cancel the whole workflow.

Some example of workflow

Conclusion We have presented some useful tools to handle many jobs at a time slurm breakit ktf maestro decimate Typical # job < 800 > 800 100 1-1000? Job are same same different different different parameter 1 1 several many any #nodes/job same same any same Any dependent One at a time One at a time no no yes Your feedback is needed! help@hpc.kaust.edu.sa