Boost your efficiency when dealing with multiple jobs on the Cray XC40 supercomputer Shaheen II. KAUST Supercomputing Laboratory KSL Workshop Series
|
|
- Rudolph Clark
- 6 years ago
- Views:
Transcription
1 Boost your efficiency when dealing with multiple jobs on the Cray XC40 supercomputer Shaheen II Samuel KORTAS KAUST Supercomputing Laboratory KSL Workshop Series June 5th t 2016
2 Agenda A few tips when dealing with numerous jobs Slurm way (up to a limit) Four KSL tools to move you further Breakit (1 to 10000s, all same) KTF (1 to 100, tuned) Avati (1 to 1000s, programmed) Decimate (dependent jobs) Hands-out session: /scratch/tmp/ksl_workshop Documentation on hpc.kaust.edu.sa/1001_jobs (to be completed today) Conclusion
3 Launching thousands of jobs Some of our users use shaheen to explore parameters sweeping involving thousands of jobs saving thousands of temporary files Need a result in a guaranteed time Are not hpc experts, but are challenging problem in terms of scheduling and file system stress. Implement complex workflows sending the output of one code into the input of others and producing a lot of small files
4 Scheduling thousands of jobs KSL does its best but it's not that easy folks! The tetris Game gets rough with long rectangles ;-( Time X 1000s!!!! 6144 Nodes availables `
5 Let's help the scheduler! (1/5) Putting the right elapsed time
6 Let's help the scheduler! (2/5) Let's share resources better among us Current policy of scheduler is first in first served Your priority increases as long as you are waiting 'actively' in the queue, hold or dependent jobs are not counted Slurm takes into account your backfilling potential But we have to share guys number of jobs in the queue is limited Fair share slurm implementation is reported to work well with only a small number of projects
7 Let's help the scheduler! (3/5) Let's lower the stress on the filesystem Each one of the 1000s jobs may need to read, probe or write a file. We got a unique filesystem shared by all the jobs, let's save it Lustre is not tuned for little files Let's use ramdisk when it's possible and save data that matters to Lustre (see next slide) Let's communicate in memory instead of via files Let's choose the right stripe count
8 Let's help the scheduler! (4/5) How to use ramdisk? On each shaheen II computing, /tmp is a ramdisk, a POSIX filesystem hosted directly in memory starting at 64 GB, it shrinks as your program uses more and more memory an additional memory requests or a write in /tmp fails when : size(os) + size(program instructions) + size(program variable) + size(/tmp) > 128 GB Still /tmp is the fastest filesystem of all (compared to lustre and datawarp) But it's distibuted and lost at the end of the job. think of storing temporary files in /tmp and save them at the end of the job think of storing frequently accessed files in /tmp
9 Let's help the scheduler! (5/5) Off-loading the cdls to compute nodes You may need to Pre/postprocess Monitor a job Relaunch it Get notified when it's starting or ending... Automate all this and move the load from the cdl to the compute nodes Use #SBATCH mail-user Use breakit, ktf, maestro, decimate Ask KSL team for help: it's only a script away
10 Managing 1001 jobs 1 - the SLURM way submitting Arrays...
11 Slurm Way (1/3) Slurm can Submit and manage collection of similar jobs easily job_array To submit 500 element job array: sbatch --array= N1 -i my_in_%a -o my_out_%a job.sh where %a in file name mapped to array task ID (1 500) squeue -r -user <my_user_name> 'unfolds' job queued as job array More info at
12 Slurm Way (2/3) Job environment variables squeue and scancel commands plus some scontrol options can operate on entire job array or select task IDs squeue -r option prints each task ID separately
13 Slurm Way (3/3) Job example Possible commands: sbatch --array=1-16 my_job sbatch --array=1-500%20 my_job only allow 20 active running jobs at a given time Taken from
14 Slurm Way But Slurm count each job of the array as a job per se: as for now the total number of jobs in the queue is limited to 800 jobs per user Pending job are not gaining priority Only one parameter can vary if need to work on several parameter, the script himself has to deduce them from the number in the array...
15 Slurm Way hands-on Submit the job /scratch/tmp/ksl_workshop/slurm/demo.job As an array of 20 occurrences, check the script, its output The queue Cancel it
16 Slurm Way hands-on solution Submit the job /scratch/tmp/ksl_workshop/slurm/demo.job sbatch array=1-20 /scratch/tmp/ksl_workshop/slurm/demo.job As an array of 20 occurrences, check the script, its output, The queue, squeue -r --user=<my_user> Cancel it scancel -n <my_job_name>
17 Managing 1001 jobs? 4 KSL open source Tools
18 Why? Ease your life and centralize some common developments breakit ktf maestro Availaible on shaheen as modules decimate Under development for 2 PIs released soon on bitbucket.org Soon Available at (GNU GPL License) Written in python 2.7 Installed on Shaheen II, Portable on workstation, Noor Our Goal: Hiding Complexity All share common api and internal library engine also available on bitbucket.org/kaust_ksl Maintained by KSL (samuel.kortas (at) kaust.edu.sa)
19 Managing 1001 jobs Using the breakit wrapper
20 Breakit (1/3) Idea and status To allow you to cope seamlessly with the limit of 800 jobs No need to change your job array Breakit automatically monitors the process for you version 0.1 I need your feedback!
21 Slurm way (1/2) How to handle it with slurm? You Or prog on cdl Max number of jobs in queue
22 Slurm way (2/2) How to handle it with slurm You Or prog on cdl Max number of jobs in queue
23 Breakit (2/3) How does it work? breakit Max number of jobs
24 Breakit (2/3) How does it work? breakit Max number of jobs
25 Breakit (2/3) How does it work? Gone! Max number of jobs Breakit is not active anymore!
26 Breakit (2/3) How does it work? Gone!t Max number of jobs The jobs are starting
27 Breakit (2/3) How does it work? Max number of jobs They submit the next jobs with a dependency
28 Breakit (2/3) How does it work? Max number of jobs First stop are done dependency is solved Next ones are prending
29 Breakit (2/3) How does it work? Max number of jobs They submit the next jobs with a dependency
30 Breakit (2/3) How does it work? Instead of submitting all the jobs, they are submitted by chunks Chunk #n is running or pending Chunk #n+1 is depending on Chunk #n, Starts only when every jobs of chunk #n have completed Submit Chunk #n+2 setting a dependency on Chunk # n+1.we did offload some task from the cdl on compute nodes ;-)
31 Breakit (3/3) How to use it? 1) Load the breakit module module load breakit man breakit (to be completed) breakit -h 2) Launch your job: breakit --job=your.job array=<nb of jobs> --chunk=<max_nb_of_jobs_in_queue> 3) Manage it: squeue -r -u <user> -n <job_name> scancel -n <job_name>
32 Breakit Hands on Via breakit submit an array of 100 occurrences of job /scratch/tmp/breakit/demo.job only having 16 jobs simultaneously in the queue
33 Breakit Hands on (solution) Via breakit submit an array of 100 occurrences of job /scratch/tmp/breakit/demo.job only having 16 jobs simultaneously in the queue module load breakit breakit --job=/scratch/tmp/breakit/demo.job --range=100 --chunk=16
34 Breakit Next steps Find a better name! Support all array range (not only 1-n) Provide an easy restart Provide an easier way to kill jobs
35 Managing 101 jobs Using KTF
36 KTF Idea At a certain point, you may need: to evaluate the performance of a code under different conditions, to run a parametric study. the same executable is run several times with a different set of parameters Physical values characterizing the problem, number of processors, threads and/or nodes compiler used compiling option parameters passed on the srun command line to experiment different placement strategies KTF (Kaust Test Framework) can help you on this!
37 What is KTF? KTF (Kaust Test Framework) has been designed and used during Shaheen II procurement in order to ease Generation Submission Monitoring Result collecting Of a set of jobs depending on a set of parameters to explore. Written in python 2.7 Self-contained and portable Available on bitbucket.org/kaust_ksl/ktf
38 How does KTF works? A few definitions An 'experiment' A case is one single run of this experiment with a given set of parameters A test gathers a number of cases
39 How does KTF works? KTF relies on A centralized file listing all combinations of parameters to address : ie shaheen_cases.kt A set of template files where the parameters needs to be replaced before the submission in all files ending by.template
40 KTF hands-on! (1/) Initialize environment 1) Load the environment, and check that ktf is available module load ktf man ktf ktf -h 2) Create and initialize your working directory mkdir <my_test_dir> cd <my_test_dir> ktf --init you should get a ktf-like tree structure with some example of centralize case files and associated templates 3) Examine the case file shaheen_cases.ktf, understands the ktf syntax, modify parameters and check your change by listing all the combinations ktf --exp
41 KTF Centralized case file (see file shaheen_zephyr0.ktf) KTF comment list of parameters third test case # is a comment not parsed by KTF First line gives the name of the parameters Case and Experiment are absolutely mandatory Each line following is a test case, setting value for EACH of parameter According to this case file, for the third test case, in each file ending by.template: Case will be replaced by 128 Experiment will be replaced by zephyr/stong NX will be replaced by 255 NY will be replaced by 255 NB_CORES will be replaced by 128 ELLAPSED_TIME will be replaced by 0:05:00
42 KTF Directory initial structure subdirectory containing files common to all the experiments one experiment directory one experiment directory default case file ktf
43 KTF job.shaheen.template (see files in tests/zephyr/strong/) KTF comment list of parameters third test case file job.shaheen.template./zephyr input
44 KTF job.shaheen.template (see files in tests/zephyr/strong/) KTF comment list of parameters third test case file input.template
45 KTF commands ktf... --help : get help on command line --init : initialize the environment copying example.template and.kt files --build : generate all combination listed in the case file --launch: generate all combination listed in the case file and submit them --exp : list all combination present in the case.ktf file --monitor: monitor all the experiments and displays all results in a dashboard --kill : kill all jobs related to this ktf session --status : list all stamp dates and cases of the experiments made or currently occuring
46 KTF hands-on! (2/) Prepare a first experiment 4) Examine the case file shaheen_cases.ktf, understands the ktf syntax, modify parameters and check your change by listing all the combinations ktf --exp 5) Build an experiment and check that the templated files have been well processed ktf --build should create one tests_ directories : tests_shaheen_<date>_<time>
47 KTF Directory KTF Directory after --build Initial template Third case
48 KTF Directory KTF Directory after --launch Zephyr is copied from the common directory job.shaheen processed from job.shaheen.template input processed from input.template
49 KTF Centralized case file Handling constant parameters File shaheen_zephyr0.ktf. KTF comment list of parameters third test case. strictly identical to File shaheen_zephyr1.ktf list of parameters #KTF pragma declaring new parameters that will keep same value ever after
50 Another example KTF case file Case Experiment Experiment
51 KTF filters and flags ktf --xxx... --case-file=<case file> : use another case files than shahen_cases.kt --what=zzzz : filters on some cases --reservation=<reservation name> : submit within a reservation ktf --exp --what=128 ktf --launch what=64 --reservation=workshop ktf --exp case-file=shaheen_zephyr1.ktf
52 KTF filters and flags ktf --xxx... --ktf-file=<case file> : use another case files than shahen_cases.ktf --what=zzzz : filters on some cases --when=yyyy --today --now : filters on some date stamps --times=<nb>: repeat submission <nb> times --info : switch on informative traces --info-level=[ ] : change informative trace level --debug : switch on debugging traces --debug-level=[ ] : change debugging trace level
53 KTF hands-on! (3/) Playing with what filter 4) Examine the case file shaheen_cases.ktf, understands the ktf syntax, modify parameters and check your change by listing all the combinations with or without filtering and using other cases files ktf --exp ktf --exp --what=<your filter> ktf --exp case-file=shaheen_zephyr1.ktf 5) Build an experiment and check that the templated files have been well processed ktf --build ktf --build --what=<your filter> should create two tests directories from where you call ktf tests_shaheen_<date>_<time>
54 KTF hands-on! (3/) launch and monitor our first experiment 6) Build an experiment and submit it ktf launch [ --reservation=workshop ] should create a new tests directory and spawn the jobs./tests_shaheen_<date>_<time> ktf --monitor will monitor your current ktf session check what shows in the R/ directory 7) Play with repeating experiments and filtering results ktf --launch --what=<your filter> [ --reservation=workshop ] ktf --launch --times=5 [ --reservation=workshop ] ktf --monitor ktf --monitor --what=<your case filter> --when=<your date filter> check what shows in the R/ directory
55 KTF results dashboard reading the result dashboard % ktf --monitor
56 KTF results dashboard reading the result dashboard % ktf --monitor When What! r.er Job mpty e not ir bd Su / in R us t Sta e Timt No yet hed s fini
57 KTF R/ directory quick access to results This R/ directory is updated each time you call kt --monitor It builds symbolic links to the results directory in order to provide you quick access to the results you want to check.
58 KTF R/ directory quick access to results directory ^
59 KTF results configuration implementation and default printing In fact alias ktf = python run_test.py alias ki = python run_test.py --init alias km = python run_test.py --monitor In run_test.py, is encoded the value to be displayed in the dashboard (printed when calling monitor) By default, it is <ellapsed time taken by the whole test>/<status of the test> with a '!' after the status if ever job.err is not empty with a '!' before the status if ever the job is not terminated properly remember you can use cat or more or tail R/*/job.err to scan all these files!
60 KTF results configuration changing default printing But you can change the displayed values at will! And adapt it to your own needs: Other values: Flops, intermediate results, total number of iterations, convergence rate, Several values : <flops>/<time>/<status> Other event to trigger the '!' sign Other typographic signs how to do it
61 KTF run_test.py file
62 KTF hands-on! (5/) modifying the result printed 8) Check what ktf prints of it: ktf --monitor and understand how run_test.py is working 9) Modify run_test.py in order to print the time per iteration
63 KTF Next steps Gather tests into campaign Have a better display --monitor option, Web interface, Automated generation of plots Enrich the filtering feature : regular expression, several filters possible Enable coding capability inside the case file Complete the documentation Save results into database and be able to compute statistics Cover the compiling step
64 KTF Next steps Support clean and campaigns Chains several jobs into one Support job arrays, dependencies, mail to user Port on Noor and workstation Offload from workstation to shaheen Better versioning of the template file Decline one ktf initial environment per science fields
65 Managing 1001 jobs using Maestro
66 Maestro principles (1/2) Handling these studies should be same on: A linux box Shaheen, Noor, Stampede A laptop under windows or mac OS A given set of linux boxes The only prerequisite: Python > 2.4 and MPI on a supercomputer Python > 2.4 on a workstation
67 Maestro principles (2/2) Minimal or no knowledge of HPC environment required Easy management of the jobs handled as a whole.
68 A set of tools adapted to a distributed execution (1/3) No pre-installation needed on the machines: maestro is self contained Easy and quick prototyping on workstation with immediate porting on supercomputer Global Error signals easy to throw and trace Global handling of the jobs as a whole study (launching, monitoring, killing and restarting through one command)
69 A set of tools adapted to a distributed execution (2/3) All the flexibility of python available to the user in a distributed environment (class inheritance, modules ) production of code robust, easy to read with an explicit error stack in case of problem to debug Transparent replication of the environment on each of the compute nodes Work in /tmp of each compute node to minimize the stress of the filesystem
70 A set of tools adapted to a distributed execution (3/3) Extended Grep (multi-line, multi-column, regular expressions) to postprocess the output files Centralized management of the template to replace Global selection of files to be kept and parametrization of the receiving directory A console to explore easily subdirectories where results are saved Each running process can write in a same global file
71 Maestro Principles maestro
72 Maestro Principles maestro Maestro Allocate A pool of Nodes and runs elementary job in it
73 Maestro Principles maestro Maestro Allocate A pool of Nodes and runs elementary job in it
74 Maestro Principles maestro Maestro Allocate A pool of Nodes and runs elementary job in it
75 An example File to save Directory name where Results are saved Elementary computation Sending local and Global messages Parametrized Z range Definition of the domain to sweep
76 Command line options <no option> : classical sequential run on 1 core stopping at the first error encountered --cores=<n> : parallel run on n cores --depth=<p> : partial parallelisation up to level p --stat : live status of ongoing computation --reservation=<id> : run inside a reservation --time=hh:mm:ss : set the elapsed duration of the overall job --kill : kills ongoing computation and clean environment --resume : resume a computation --restart : restart from scratch a computation --help : help screen
77 Demo!
78 Next Steps Allowing maestro to launch multicore jobs More clever sweeping algorithms decime project Support of a given set of workstation Coupling maestro with website Remote launching and dynamic off-loading from workstation to supercomputer
79 Managing depedent jobs in complex workflow Using Decimate
80 Idea Some workflow involve several steps depending of one another several jobs with a dependency between them Some intermediate steps may break dependency will break the workflow will remain idle, requesting an action We want to automate it
81 What is decimate? Add-ons and goodies Tool in python written for two different PIs with the same need Launch, monitor, heal dependent jobs Make things automated and smooth
82 What is decimate? Add-ons Centralized log files, Global resume, --status and kill command Sends a mail at any time to the user to keep him updated Can make decision when dependency is broken Relaunch same job again and fix dependency Change input data, relaunch and fix dependency cancel only this job and move on. Cancel the whole workflow.
83 Some example of workflow
84 Conclusion We have presented some useful tools to handle many jobs at a time slurm breakit ktf maestro decimate Typical # job < 800 > ? Job are same same different different different parameter 1 1 several many any #nodes/job same same any same Any dependent One at a time One at a time no no yes Your feedback is needed! help@hpc.kaust.edu.sa
How to run a job on a Cluster?
How to run a job on a Cluster? Cluster Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory Samuel.kortas@kaust.edu.sa 17 October 2017 Outline 1. Resources available
More informationSlurm basics. Summer Kickstart June slide 1 of 49
Slurm basics Summer Kickstart 2017 June 2017 slide 1 of 49 Triton layers Triton is a powerful but complex machine. You have to consider: Connecting (ssh) Data storage (filesystems and Lustre) Resource
More informationAn introduction to checkpointing. for scientifc applications
damien.francois@uclouvain.be UCL/CISM An introduction to checkpointing for scientifc applications November 2016 CISM/CÉCI training session What is checkpointing? Without checkpointing: $./count 1 2 3^C
More informationAn introduction to checkpointing. for scientific applications
damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI An introduction to checkpointing for scientific applications November 2013 CISM/CÉCI training session What is checkpointing? Without checkpointing: $./count
More informationIntroduction to RCC. September 14, 2016 Research Computing Center
Introduction to HPC @ RCC September 14, 2016 Research Computing Center What is HPC High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers
More informationIntroduction to High-Performance Computing (HPC)
Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit cores : individual processing units within a CPU Storage : Disk drives HDD : Hard Disk Drive SSD : Solid
More informationIntroduction to RCC. January 18, 2017 Research Computing Center
Introduction to HPC @ RCC January 18, 2017 Research Computing Center What is HPC High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much
More informationCompiling applications for the Cray XC
Compiling applications for the Cray XC Compiler Driver Wrappers (1) All applications that will run in parallel on the Cray XC should be compiled with the standard language wrappers. The compiler drivers
More informationBatch Systems & Parallel Application Launchers Running your jobs on an HPC machine
Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike
More informationHigh Performance Computing Cluster Advanced course
High Performance Computing Cluster Advanced course Jeremie Vandenplas, Gwen Dawes 9 November 2017 Outline Introduction to the Agrogenomics HPC Submitting and monitoring jobs on the HPC Parallel jobs on
More informationHigh Performance Computing Cluster Basic course
High Performance Computing Cluster Basic course Jeremie Vandenplas, Gwen Dawes 30 October 2017 Outline Introduction to the Agrogenomics HPC Connecting with Secure Shell to the HPC Introduction to the Unix/Linux
More informationSlurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012
Slurm and Abel job scripts Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012 Abel in numbers Nodes - 600+ Cores - 10000+ (1 node->2 processors->16 cores) Total memory
More informationBatch Systems. Running your jobs on an HPC machine
Batch Systems Running your jobs on an HPC machine Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationNew User Seminar: Part 2 (best practices)
New User Seminar: Part 2 (best practices) General Interest Seminar January 2015 Hugh Merz merz@sharcnet.ca Session Outline Submitting Jobs Minimizing queue waits Investigating jobs Checkpointing Efficiency
More informationIntroduction to GALILEO
Introduction to GALILEO Parallel & production environment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Domenico Guida d.guida@cineca.it Maurizio Cremonesi m.cremonesi@cineca.it
More informationIntroduction to the Cluster
Follow us on Twitter for important news and updates: @ACCREVandy Introduction to the Cluster Advanced Computing Center for Research and Education http://www.accre.vanderbilt.edu The Cluster We will be
More informationSlurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC
Slurm Overview Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17 Outline Roles of a resource manager and job scheduler Slurm description and design goals Slurm architecture and plugins Slurm
More informationSTARTING THE DDT DEBUGGER ON MIO, AUN, & MC2. (Mouse over to the left to see thumbnails of all of the slides)
STARTING THE DDT DEBUGGER ON MIO, AUN, & MC2 (Mouse over to the left to see thumbnails of all of the slides) ALLINEA DDT Allinea DDT is a powerful, easy-to-use graphical debugger capable of debugging a
More informationDuke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu
Duke Compute Cluster Workshop 3/28/2018 Tom Milledge rc.duke.edu rescomputing@duke.edu Outline of talk Overview of Research Computing resources Duke Compute Cluster overview Running interactive and batch
More informationIntroduction to SLURM on the High Performance Cluster at the Center for Computational Research
Introduction to SLURM on the High Performance Cluster at the Center for Computational Research Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St Buffalo, NY
More informationSherlock for IBIIS. William Law Stanford Research Computing
Sherlock for IBIIS William Law Stanford Research Computing Overview How we can help System overview Tech specs Signing on Batch submission Software environment Interactive jobs Next steps We are here to
More informationSlurm at UPPMAX. How to submit jobs with our queueing system. Jessica Nettelblad sysadmin at UPPMAX
Slurm at UPPMAX How to submit jobs with our queueing system Jessica Nettelblad sysadmin at UPPMAX Slurm at UPPMAX Intro Queueing with Slurm How to submit jobs Testing How to test your scripts before submission
More informationIntroduction to SLURM & SLURM batch scripts
Introduction to SLURM & SLURM batch scripts Anita Orendt Assistant Director Research Consulting & Faculty Engagement anita.orendt@utah.edu 6 February 2018 Overview of Talk Basic SLURM commands SLURM batch
More informationSubmitting batch jobs
Submitting batch jobs SLURM on ECGATE Xavi Abellan Xavier.Abellan@ecmwf.int ECMWF February 20, 2017 Outline Interactive mode versus Batch mode Overview of the Slurm batch system on ecgate Batch basic concepts
More informationIntroduction to High-Performance Computing (HPC)
Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit cores : individual processing units within a CPU Storage : Disk drives HDD : Hard Disk Drive SSD : Solid
More informationWorking with Shell Scripting. Daniel Balagué
Working with Shell Scripting Daniel Balagué Editing Text Files We offer many text editors in the HPC cluster. Command-Line Interface (CLI) editors: vi / vim nano (very intuitive and easy to use if you
More informationOBTAINING AN ACCOUNT:
HPC Usage Policies The IIA High Performance Computing (HPC) System is managed by the Computer Management Committee. The User Policies here were developed by the Committee. The user policies below aim to
More informationCNAG Advanced User Training
www.bsc.es CNAG Advanced User Training Aníbal Moreno, CNAG System Administrator Pablo Ródenas, BSC HPC Support Rubén Ramos Horta, CNAG HPC Support Barcelona,May the 5th Aim Understand CNAG s cluster design
More informationApplications Software Example
Applications Software Example How to run an application on Cluster? Rooh Khurram Supercomputing Laboratory King Abdullah University of Science and Technology (KAUST), Saudi Arabia Cluster Training: Applications
More informationLustre Parallel Filesystem Best Practices
Lustre Parallel Filesystem Best Practices George Markomanolis Computational Scientist KAUST Supercomputing Laboratory georgios.markomanolis@kaust.edu.sa 7 November 2017 Outline Introduction to Parallel
More informationIntroduction to the Cluster
Introduction to the Cluster Advanced Computing Center for Research and Education http://www.accre.vanderbilt.edu Follow us on Twitter for important news and updates: @ACCREVandy The Cluster We will be
More informationBeginner's Guide for UK IBM systems
Beginner's Guide for UK IBM systems This document is intended to provide some basic guidelines for those who already had certain programming knowledge with high level computer languages (e.g. Fortran,
More informationDuke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/
Duke Compute Cluster Workshop 11/10/2016 Tom Milledge h:ps://rc.duke.edu/ rescompu>ng@duke.edu Outline of talk Overview of Research Compu>ng resources Duke Compute Cluster overview Running interac>ve and
More informationIntroduction to PICO Parallel & Production Enviroment
Introduction to PICO Parallel & Production Enviroment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Domenico Guida d.guida@cineca.it Nicola Spallanzani n.spallanzani@cineca.it
More informationRHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK
RHRK-Seminar High Performance Computing with the Cluster Elwetritsch - II Course instructor : Dr. Josef Schüle, RHRK Overview Course I Login to cluster SSH RDP / NX Desktop Environments GNOME (default)
More informationIntroduction to SLURM & SLURM batch scripts
Introduction to SLURM & SLURM batch scripts Anita Orendt Assistant Director Research Consulting & Faculty Engagement anita.orendt@utah.edu 16 Feb 2017 Overview of Talk Basic SLURM commands SLURM batch
More informationSlurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013
Slurm and Abel job scripts Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013 Abel in numbers Nodes - 600+ Cores - 10000+ (1 node->2 processors->16 cores) Total memory
More informationHPC Introductory Course - Exercises
HPC Introductory Course - Exercises The exercises in the following sections will guide you understand and become more familiar with how to use the Balena HPC service. Lines which start with $ are commands
More informationHeterogeneous Job Support
Heterogeneous Job Support Tim Wickberg SchedMD SC17 Submitting Jobs Multiple independent job specifications identified in command line using : separator The job specifications are sent to slurmctld daemon
More informationDuke Compute Cluster Workshop. 10/04/2018 Tom Milledge rc.duke.edu
Duke Compute Cluster Workshop 10/04/2018 Tom Milledge rc.duke.edu rescomputing@duke.edu Outline of talk Overview of Research Computing resources Duke Compute Cluster overview Running interactive and batch
More informationIntroduction to SLURM & SLURM batch scripts
Introduction to SLURM & SLURM batch scripts Anita Orendt Assistant Director Research Consulting & Faculty Engagement anita.orendt@utah.edu 23 June 2016 Overview of Talk Basic SLURM commands SLURM batch
More informationUsing a Linux System 6
Canaan User Guide Connecting to the Cluster 1 SSH (Secure Shell) 1 Starting an ssh session from a Mac or Linux system 1 Starting an ssh session from a Windows PC 1 Once you're connected... 1 Ending an
More informationBRC HPC Services/Savio
BRC HPC Services/Savio Krishna Muriki and Gregory Kurtzer LBNL/BRC kmuriki@berkeley.edu, gmk@lbl.gov SAVIO - The Need Has Been Stated Inception and design was based on a specific need articulated by Eliot
More informationHow to access Geyser and Caldera from Cheyenne. 19 December 2017 Consulting Services Group Brian Vanderwende
How to access Geyser and Caldera from Cheyenne 19 December 2017 Consulting Services Group Brian Vanderwende Geyser nodes useful for large-scale data analysis and post-processing tasks 16 nodes with: 40
More informationIntroduction to Slurm
Introduction to Slurm Tim Wickberg SchedMD Slurm User Group Meeting 2017 Outline Roles of resource manager and job scheduler Slurm description and design goals Slurm architecture and plugins Slurm configuration
More informationHPC Input/Output. I/O and Darshan. Cristian Simarro User Support Section
HPC Input/Output I/O and Darshan Cristian Simarro Cristian.Simarro@ecmwf.int User Support Section Index Lustre summary HPC I/O Different I/O methods Darshan Introduction Goals Considerations How to use
More informationCSC BioWeek 2018: Using Taito cluster for high throughput data analysis
CSC BioWeek 2018: Using Taito cluster for high throughput data analysis 7. 2. 2018 Running Jobs in CSC Servers Exercise 1: Running a simple batch job in Taito We will run a small alignment using BWA: https://research.csc.fi/-/bwa
More informationSzámítogépes modellezés labor (MSc)
Számítogépes modellezés labor (MSc) Running Simulations on Supercomputers Gábor Rácz Physics of Complex Systems Department Eötvös Loránd University, Budapest September 19, 2018, Budapest, Hungary Outline
More informationIntroduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU
Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU What is Joker? NMSU s supercomputer. 238 core computer cluster. Intel E-5 Xeon CPUs and Nvidia K-40 GPUs. InfiniBand innerconnect.
More informationSCALABLE HYBRID PROTOTYPE
SCALABLE HYBRID PROTOTYPE Scalable Hybrid Prototype Part of the PRACE Technology Evaluation Objectives Enabling key applications on new architectures Familiarizing users and providing a research platform
More informationAn Introduction to Gauss. Paul D. Baines University of California, Davis November 20 th 2012
An Introduction to Gauss Paul D. Baines University of California, Davis November 20 th 2012 What is Gauss? * http://wiki.cse.ucdavis.edu/support:systems:gauss * 12 node compute cluster (2 x 16 cores per
More informationUsing Cartesius and Lisa. Zheng Meyer-Zhao - Consultant Clustercomputing
Zheng Meyer-Zhao - zheng.meyer-zhao@surfsara.nl Consultant Clustercomputing Outline SURFsara About us What we do Cartesius and Lisa Architectures and Specifications File systems Funding Hands-on Logging
More informationIntroduction to Visualization on Stampede
Introduction to Visualization on Stampede Aaron Birkland Cornell CAC With contributions from TACC visualization training materials Parallel Computing on Stampede June 11, 2013 From data to Insight Data
More informationCRUK cluster practical sessions (SLURM) Part I processes & scripts
CRUK cluster practical sessions (SLURM) Part I processes & scripts login Log in to the head node, clust1-headnode, using ssh and your usual user name & password. SSH Secure Shell 3.2.9 (Build 283) Copyright
More informationIntroduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende
Introduction to the NCAR HPC Systems 25 May 2018 Consulting Services Group Brian Vanderwende Topics to cover Overview of the NCAR cluster resources Basic tasks in the HPC environment Accessing pre-built
More informationUL HPC Monitoring in practice: why, what, how, where to look
C. Parisot UL HPC Monitoring in practice: why, what, how, where to look 1 / 22 What is HPC? Best Practices Getting Fast & Efficient UL HPC Monitoring in practice: why, what, how, where to look Clément
More informationExercises: Abel/Colossus and SLURM
Exercises: Abel/Colossus and SLURM November 08, 2016 Sabry Razick The Research Computing Services Group, USIT Topics Get access Running a simple job Job script Running a simple job -- qlogin Customize
More informationData storage on Triton: an introduction
Motivation Data storage on Triton: an introduction How storage is organized in Triton How to optimize IO Do's and Don'ts Exercises slide 1 of 33 Data storage: Motivation Program speed isn t just about
More informationUsing the Yale HPC Clusters
Using the Yale HPC Clusters Robert Bjornson Yale Center for Research Computing Yale University Feb 2017 What is the Yale Center for Research Computing? Independent center under the Provost s office Created
More information1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents Overview. Principal concepts. Architecture. Scheduler Policies. 2 Bull, 2011 Bull Extreme Computing SLURM Overview Ares, Gerardo, HPC Team Introduction
More informationECE 598 Advanced Operating Systems Lecture 22
ECE 598 Advanced Operating Systems Lecture 22 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 19 April 2016 Announcements Project update HW#9 posted, a bit late Midterm next Thursday
More informationIntroduction to Discovery.
Introduction to Discovery http://discovery.dartmouth.edu The Discovery Cluster 2 Agenda What is a cluster and why use it Overview of computer hardware in cluster Help Available to Discovery Users Logging
More informationDay 9: Introduction to CHTC
Day 9: Introduction to CHTC Suggested reading: Condor 7.7 Manual: http://www.cs.wisc.edu/condor/manual/v7.7/ Chapter 1: Overview Chapter 2: Users Manual (at most, 2.1 2.7) 1 Turn In Homework 2 Homework
More informationAASPI Software Structure
AASPI Software Structure Introduction The AASPI software comprises a rich collection of seismic attribute generation, data conditioning, and multiattribute machine-learning analysis tools constructed by
More informationCSC BioWeek 2016: Using Taito cluster for high throughput data analysis
CSC BioWeek 2016: Using Taito cluster for high throughput data analysis 4. 2. 2016 Running Jobs in CSC Servers A note on typography: Some command lines are too long to fit a line in printed form. These
More informationLAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
LAB Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012 1 Discovery
More informationExeco tutorial Grid 5000 school, Grenoble, January 2016
Execo tutorial Grid 5000 school, Grenoble, January 2016 Simon Delamare Matthieu Imbert Laurent Pouilloux INRIA/CNRS/LIP ENS-Lyon 03/02/2016 1/34 1 introduction 2 execo, core module 3 execo g5k, Grid 5000
More informationCSCS Proposal writing webinar Technical review. 12th April 2015 CSCS
CSCS Proposal writing webinar Technical review 12th April 2015 CSCS Agenda Tips for new applicants CSCS overview Allocation process Guidelines Basic concepts Performance tools Demo Q&A open discussion
More informationUsing Compute Canada. Masao Fujinaga Information Services and Technology University of Alberta
Using Compute Canada Masao Fujinaga Information Services and Technology University of Alberta Introduction to cedar batch system jobs are queued priority depends on allocation and past usage Cedar Nodes
More informationReFrame: A Regression Testing Framework Enabling Continuous Integration of Large HPC Systems
ReFrame: A Regression Testing Framework Enabling Continuous Integration of Large HPC Systems HPC Advisory Council 2018 Victor Holanda, Vasileios Karakasis, CSCS Apr. 11, 2018 ReFrame in a nutshell Regression
More informationUsing and Modifying the BSC Slurm Workload Simulator. Slurm User Group Meeting 2015 Stephen Trofinoff and Massimo Benini, CSCS September 16, 2015
Using and Modifying the BSC Slurm Workload Simulator Slurm User Group Meeting 2015 Stephen Trofinoff and Massimo Benini, CSCS September 16, 2015 Using and Modifying the BSC Slurm Workload Simulator The
More informationIntroduction to High Performance Computing at Case Western Reserve University. KSL Data Center
Introduction to High Performance Computing at Case Western Reserve University Research Computing and CyberInfrastructure team KSL Data Center Presenters Emily Dragowsky Daniel Balagué Guardia Hadrian Djohari
More informationCOSC 6374 Parallel Computation. Debugging MPI applications. Edgar Gabriel. Spring 2008
COSC 6374 Parallel Computation Debugging MPI applications Spring 2008 How to use a cluster A cluster usually consists of a front-end node and compute nodes Name of the front-end node: shark.cs.uh.edu You
More informationSubmitting and running jobs on PlaFRIM2 Redouane Bouchouirbat
Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat Summary 1. Submitting Jobs: Batch mode - Interactive mode 2. Partition 3. Jobs: Serial, Parallel 4. Using generic resources Gres : GPUs, MICs.
More informationXSEDE New User Training. Ritu Arora November 14, 2014
XSEDE New User Training Ritu Arora Email: rauta@tacc.utexas.edu November 14, 2014 1 Objectives Provide a brief overview of XSEDE Computational, Visualization and Storage Resources Extended Collaborative
More informationSlurm at UPPMAX. How to submit jobs with our queueing system. Jessica Nettelblad sysadmin at UPPMAX
Slurm at UPPMAX How to submit jobs with our queueing system Jessica Nettelblad sysadmin at UPPMAX Free! Watch! Futurama S2 Ep.4 Fry and the Slurm factory Simple Linux Utility for Resource Management Open
More informationIntroduction to GACRC Teaching Cluster PHYS8602
Introduction to GACRC Teaching Cluster PHYS8602 Georgia Advanced Computing Resource Center (GACRC) EITS/University of Georgia Zhuofei Hou zhuofei@uga.edu 1 Outline GACRC Overview Computing Resources Three
More informationBash for SLURM. Author: Wesley Schaal Pharmaceutical Bioinformatics, Uppsala University
Bash for SLURM Author: Wesley Schaal Pharmaceutical Bioinformatics, Uppsala University wesley.schaal@farmbio.uu.se Lab session: Pavlin Mitev (pavlin.mitev@kemi.uu.se) it i slides at http://uppmax.uu.se/support/courses
More informationPractical: a sample code
Practical: a sample code Alistair Hart Cray Exascale Research Initiative Europe 1 Aims The aim of this practical is to examine, compile and run a simple, pre-prepared OpenACC code The aims of this are:
More informationLinux Essentials. Smith, Roderick W. Table of Contents ISBN-13: Introduction xvii. Chapter 1 Selecting an Operating System 1
Linux Essentials Smith, Roderick W. ISBN-13: 9781118106792 Table of Contents Introduction xvii Chapter 1 Selecting an Operating System 1 What Is an OS? 1 What Is a Kernel? 1 What Else Identifies an OS?
More informationCSCI 447 Operating Systems Filip Jagodzinski
Filip Jagodzinski Announcements Homework 1 An extension of Lab 1 Big picture : for Homework 1 and 2, we ll focus on the lowlevel mechanics of the OS. Per the instructions, create a new branch in your gitlab
More informationGeorge Markomanolis IO500 Committee: John Bent, Julian M. Kunkel, Jay Lofstead 2017-11-12 http://www.io500.org IBM Spectrum Scale User Group, Denver, Colorado, USA Why? The increase of the studied domains,
More informationFor Dr Landau s PHYS8602 course
For Dr Landau s PHYS8602 course Shan-Ho Tsai (shtsai@uga.edu) Georgia Advanced Computing Resource Center - GACRC January 7, 2019 You will be given a student account on the GACRC s Teaching cluster. Your
More informationMonitoring and Trouble Shooting on BioHPC
Monitoring and Trouble Shooting on BioHPC [web] [email] portal.biohpc.swmed.edu biohpc-help@utsouthwestern.edu 1 Updated for 2017-03-15 Why Monitoring & Troubleshooting data code Monitoring jobs running
More informationOperating Systems 2 nd semester 2016/2017. Chapter 4: Threads
Operating Systems 2 nd semester 2016/2017 Chapter 4: Threads Mohamed B. Abubaker Palestine Technical College Deir El-Balah Note: Adapted from the resources of textbox Operating System Concepts, 9 th edition
More informationIntroduction to GACRC Teaching Cluster
Introduction to GACRC Teaching Cluster Georgia Advanced Computing Resource Center (GACRC) EITS/University of Georgia Zhuofei Hou zhuofei@uga.edu 1 Outline GACRC Overview Computing Resources Three Folders
More informationChapter 1: Distributed Information Systems
Chapter 1: Distributed Information Systems Contents - Chapter 1 Design of an information system Layers and tiers Bottom up design Top down design Architecture of an information system One tier Two tier
More informationXSEDE New User Tutorial
October 20, 2017 XSEDE New User Tutorial Jay Alameda National Center for Supercomputing Applications XSEDE Training Survey Please complete a short on line survey about this module at http://bit.ly/xsedesurvey.
More informationOperating Systems 2014 Assignment 2: Process Scheduling
Operating Systems 2014 Assignment 2: Process Scheduling Deadline: April 6, 2014, at 23:59. 1 Introduction Process scheduling is an important part of the operating system and has influence on the achieved
More informationChapter 4: Multithreaded Programming
Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013! Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Threading Issues Operating System Examples
More informationA declarative programming style job submission filter.
A declarative programming style job submission filter. Douglas Jacobsen Computational Systems Group Lead NERSC -1- Slurm User Group 2018 NERSC Vital Statistics 860 projects 7750 users Edison NERSC-7 Cray
More informationAgenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2
Lecture 3: Processes Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Process in General 3.3 Process Concept Process is an active program in execution; process
More informationHow to Use a Supercomputer - A Boot Camp
How to Use a Supercomputer - A Boot Camp Shelley Knuth Peter Ruprecht shelley.knuth@colorado.edu peter.ruprecht@colorado.edu www.rc.colorado.edu Outline Today we will discuss: Who Research Computing is
More informationDavid R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
More informationHTC Brief Instructions
HTC Brief Instructions Version 18.08.2018 University of Paderborn Paderborn Center for Parallel Computing Warburger Str. 100, D-33098 Paderborn http://pc2.uni-paderborn.de/ 2 HTC BRIEF INSTRUCTIONS Table
More informationHPC Workshop. Nov. 9, 2018 James Coyle, PhD Dir. Of High Perf. Computing
HPC Workshop Nov. 9, 2018 James Coyle, PhD Dir. Of High Perf. Computing NEEDED EQUIPMENT 1. Laptop with Secure Shell (ssh) for login A. Windows: download/install putty from https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
More informationXSEDE New User Tutorial
May 13, 2016 XSEDE New User Tutorial Jay Alameda National Center for Supercomputing Applications XSEDE Training Survey Please complete a short on-line survey about this module at http://bit.ly/hamptonxsede.
More informationECE 574 Cluster Computing Lecture 4
ECE 574 Cluster Computing Lecture 4 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 31 January 2017 Announcements Don t forget about homework #3 I ran HPCG benchmark on Haswell-EP
More informationIntroduction to Linux and Supercomputers
Introduction to Linux and Supercomputers Doug Crabill Senior Academic IT Specialist Department of Statistics Purdue University dgc@purdue.edu What you will learn How to log into a Linux Supercomputer Basics
More informationChapter 4: Threads. Chapter 4: Threads
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples
More information