Good to Great: Choosing NetworkComputer over Slurm
|
|
- Jordan McKenzie
- 5 years ago
- Views:
Transcription
1 Good to Great: Choosing NetworkComputer over Slurm NetworkComputer White Paper 2560 Mission College Blvd., Suite 130 Santa Clara, CA (408)
2 Introduction Are you considering Slurm as your job scheduler, or are you currently a user of Slurm and wondering if it is right for you because of some issues you have been encountering? If you are an administrator or user who cares about efficiency and reliability to handle high-volume workloads, it may be worthwhile to consider NetworkComputer, a more powerful, efficient and reliable commercial job scheduler that is the industry standard for scalable high performance computing. Also, if you are a Slurm user, you may be wondering how Slurm compares to NetworkComputer before making a move. This white paper helps bridge that knowledge gap so that you can see it is relatively easy to migrate from Slurm to NetworkComputer. What is Slurm? Slurm is a free workload manager that has been around since It has known limitations in scaling and meeting job capacity needs along with its inability to fully utilize all available computing resources, thus has limited applications in commercial markets due to lack of robustness. Also, it lacks monitoring capabilities, which is a major pitfall. What is NetworkComputer? NetworkComputer by Runtime, a commercial enterprise-grade job scheduler, has some similar basic capabilities compared to Slurm however offers much more practical value to the end user for every day professional use. Being a commercial scheduler, used by top companies in the world, it is many times more scalable in capacity and performance and it is much easier to use. As the industry s fastest job scheduler, NetworkComputer is built to be "light-weight" and easy to use, thus it can be deployed also as a private scheduler to be used by a single person, by a group, or by a project. If your concerns for productivity include achieving the most efficient utilization of your expensive licenses and hardware resources, NetworkComputer will best fit your needs. Comparing NetworkComputer vs. Slurm Terminology In Slurm, the central component is called slurmctld. It manages the workload and all scheduling. Each computer (referred to as a "node" by Slurm) runs a daemon called slurmd which does some analysis of the computer it is running on and then accepts jobs sent from slurmctld. The configuration for a Slurm cluster is typically kept in a single file, typically found here: /etc/slurm-llnl/slurm.config. Good to Great: Choosing NetworkComputer over Slurm Page 2
3 In NetworkComputer, the central component is called vovserver and the daemon running on each remote computer is called vovslave. The configuration is spread over several files all contained in a directory called "vnc.swd" (pronounced "swid"), which is the "Server Working Directory". NetworkComputer has a file to describe the list of slaves and another file to describe the list of resources like licenses and limits. Everything is "elastic" in the sense that slaves and resources can be added at any time and the characteristics of a vovslave can be modified at will. This flexibility is important and a key reason why NetworkComputer is used for commercial purposes. By default, vovslaves automatically detect all characteristics of the machine they are running on, including RAM and CORES. Compare this to Slurm which is nonelastic in behavior. NetworkComputer Slurm Description vovserver slurmctld The hub of the system, manager of workload and scheduling vovslave slurmd Agent to execute jobs on a "node".../vnc.swd/slaves.tcl.../vnc.swd/resources.tcl... /etc/slurm-llnl/slurm.conf Configuration files Comparing NetworkComputer vs. Slurm Commands Slurm's command line interface consists of a few commands like sbatch, scancel, squeue, sinfo, scontrol, smap. NetworkComputer command line interaction is based on two commands: ncmgr used by the manager to start and stop the system, and nc <command> used by all users. Here is the usage message from nc, where boldface is used to highlight the commands that will be mentioned in this introduction: Good to Great: Choosing NetworkComputer over Slurm Page 3
4 % nc nc: Usage Message Usage: nc [-q queuename] <command> [command options] Queue selection: The default queue is called "vnc". You can specify a different queue with the option -q <queuename> or by setting the environment variable NC_QUEUE. Commands: clean debug dispatch forget getfield gui help hosts info list jobclass kerberos modify monitor rerun resources resume run <job> preempt slavelist stop submit <job> summary suspend wait why Cleanup log files and env files. Show how to run the same job without NetworkComputer. Force dispatch of a job to a specific slave. Forget old jobs from the system. Get a field for a job. Start a simple graphical interface. This help message. Show farm hosts (also called slaves). Get information about a job and its outputs. List the jobs in the system. List the available job classes. Interface to Kerberos (experimental). Modify attributes of scheduled jobs. Monitor network activity. Rerun a job already known to the system. Shows resource list and current statistics. Resume a job previously suspended. Run a new job (also called 'submit'). Preempt a job. Show available slave lists. Stop jobs. Same as 'run'. Get a summary report for all my jobs. Suspend the execution of a job. Wait for a job to complete. Analyze job status reasons. Unique abbreviations for commands are accepted. Advanced features: cmd <command> Execute an arbitrary VOV command in the context of the NetworkComputer server. source <file.tcl> Source the given Tcl file. - Accept commands from stdin. For more help type: % nc <command> -h Copyright (c) , Runtime Design Automation. In Slurm, you need to write a script to submit a command whereas NetworkComputer allows for the submission of any type of command. For example, if one wants to submit to Slurm the command "sleep 0", a script like this must be used: Good to Great: Choosing NetworkComputer over Slurm Page 4
5 #!/bin/csh -f # This is my script called./sleep0.csh sleep 0 NetworkComputer Slurm Description nc run [OPTIONS]./myscript.csh nc run [OPTIONS] sleep 0 sbatch [OPTIONS]./myscript.csh nc stop... scancel... Methods to submit batch jobs De-schedule submitted jobs. Stop jobs if they are running nc list squeue List jobs in the system nc info JOBID nc getfield JOBID nc wait JOBID scontrol show job JOBID Detailed information about one job. Wait for the specified node/job to be done nc gui & smap Graphical visualization of jobs nc hostsnc resourcesnc cmd vsi nc hosts nc monitor sinfo sinfo -N Various commands to show information about the system List information about machines connected to scheduler Jobs Visualization and Interactive Queries in NetworkComputer In Slurm, there is no comprehensive facility to visualize your job status or point-and-click drill down easily for debugging. In NetworkComputer, you are provided with an interactive GUI where you can visualize all of the scheduled job statuses as well as have the ability to drill down into any job to get real-time details. Lots of other information such as workload and resource details are available. Figure 1: The NetworkComputer GUI shows jobs as colored boxes. The green jobs are done, the red jobs have failed, the orange jobs are currently running, the cyan jobs are waiting for resources to become available. Good to Great: Choosing NetworkComputer over Slurm Page 5
6 Figure 2: In NetworkComputer, you can customize your view to show specific job details that matter easily in the GUI in each box. You can also easily drill down to get more job details. Figure 3: NetworkComputer gives users views about workload and resources. Comparing NetworkComputer Performance vs. Slurm for Light Workloads Good to Great: Choosing NetworkComputer over Slurm Page 6
7 In this example, the Slurm cluster consists of three identical desktops, called node1, node2, node3, with the master running on node3. The NetworkComputer setup uses the same hardware, with the server running on node2. With a light load, the difference between Slurm and NetworkComputer is negligible: In Slurm: % sbatch./sleep0.csh Submitted batch job % scontrol show job JobId= Name=sleep0.csh UserId=joe(1024) GroupId=joe(1002) Priority= Account=(null) QOS=(null) JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime= T11:11:42 EligibleTime= T11:11:42 StartTime= T11:11:42 EndTime= T11:11:42 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=debug AllocNode:Sid=node2:22587 ReqNodeList=(null) ExcNodeList=(null) NodeList=node2 BatchHost=node2 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/joe/tmp/./sleep0.csh WorkDir=/home/joe/tmp In NetworkComputer: % nc run sleep 0 Fairshare= /time/users Resources= linux64 Env = SNAPSHOT(vnc_logs/snapshots/joe/linux64/env26253.env) Command = vw sleep 0 Logfile = vnc_logs/ / JobURL = JobId = % nc info Id,User,Group ,joe.joe,/time/users.joe Environment SNAPSHOT(vnc_logs/snapshots/joe/linux64/env26253.env) Directory /home/joe Command sleep 0 Resources linux64 Submitted from node2 Submitted at Wed Jan 18 11:11:16 PST 2017 Priorities schedule=normal execution=low PlacementPolicy fastest,pack Status Done Host localhost Slave localhost QueueTime 0s Good to Great: Choosing NetworkComputer over Slurm Page 7
8 CPUTime 0.01 MaxRAM 0MB Duration 0 Age 1m31s AutoForget 1 Job is Done Main Reason: This job successfully executed. To simplify automation, NetworkComputer helps the developer in simple but effective ways, such as: The option -v 1 in nc run, which returns only the ID of the submitted job The command nc getfield, which allows the direct access to one or more fields in the job without requiring any grep/awk work NetworkComputer % set id = `nc run -v 1 sleep 0` % nc wait $id % nc getfield $id status VALID No equivalent in Slurm NetworkComputer Outperforms Slurm for Normal to Heavy Workloads This is the major reason why Slurm is not fit for commercial needs. It cannot handle heavy loads. In fact, it struggles with less-than-heavy loads, as shown in the next example. In NetworkComputer, a constant load of 100,000 or more jobs in the queue is considered an ordinary load while choking Slurm. A million jobs in the queue is a heavy load, easily handled by NetworkComputer. Let us assume we have a workload consisting of 80,000 jobs. In Slurm, you may want to submit the jobs with an array. The maximum array size in our default installation seems to be 1000 elements, so we will need to submit 80 arrays. Our Slurm installation stops accepting jobs after less than 10,000 jobs are in the queue, which is a serious limitation, while NetworkComputer easily accepts the whole workload in about 6 seconds. Good to Great: Choosing NetworkComputer over Slurm Page 8
9 NetworkComputer % time repeat 80 nc run -v 0 -array 1000 sleep 0... omitting some output from 'time' u 0.008s 0: % 0+0k 0+8io 0pf+0w 0.052u 0.004s 0: % 0+0k 0+8io 0pf+0w 0.043u 0.008s 0: % 0+0k 0+8io 0pf+0w 0.051u 0.004s 0: % 0+0k 0+8io 0pf+0w 0.052u 0.000s 0: % 0+0k 0+8io 0pf+0w 3.974u 0.462s 0: % 0+0k 0+640io 0pf+0w % nc summary NC Summary For Set System:User:joe TOTAL JOBS 80,001 Duration: 3m15s Done 690 Queued 79,309 Running 2 Slurm % repeat 80 sbatch --array= /sleep0.csh Submitted batch job Submitted batch job Submitted batch job Submitted batch job Submitted batch job Submitted batch job Submitted batch job Submitted batch job Submitted batch job sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying. In a similar experiment, scaling much larger, NetworkComputer easily handles 880,000 jobs. To check the status of the workload, we can use the "summary" report option which is efficient, compact and easy to understand: (Slurm has no equivalent function) % nc summary -a -b NC Summary For Set System:jobs TOTAL JOBS 101,821 Duration: 37m40s Done 26,677 Queued 75,138 Running 4 BKT JOBS PRI AGE GROUP USER TOOL WAITING FOR , s /time/users.joe joe hostname HW linux64 Good to Great: Choosing NetworkComputer over Slurm Page 9
10 Comparing NetworkComputer vs. Slurm Scheduler Status In Slurm, to get the scheduler status, you can execute scontrol ping and get a very simple report: % scontrol ping Slurmctld(primary/backup) at node3/(null) are UP/DOWN In NetworkComputer, a common method to check status is through "nc cmd vsi" ( vsi stands for vov-server-info) where you get more meaningful relevant information: % nc cmd vsi Vov Server Information - 01/10/ :44:55 vnchq@node3:6271 URL: Jobs: 101,892 Workload: Files: 101,904 - running: 5 Sets: 22 - queued: 59,242 Retraces: 0 - done: 42,572 - failed: Slaves: 2 Buckets: 1 - busy: 1 Duration: 0s - full: 1 SchedulerTime: 0.00s Slots: TotalResources: 14 Pid: 825 Saved: 1h29m ago Size: MB TimeTolerance: 3s Recent jobs for user joe Done vw hostname > vnc_logs/ / Done vw hostname > vnc_logs/ / Done vw hostname > vnc_logs/ / Running vw hostname > vnc_logs/ / Running vw hostname > vnc_logs/ / Running vw hostname > vnc_logs/ / Comparing NetworkComputer vs. Slurm Suspension Capabilities In Slurm, you can suspend and resume a job but only if you are root or the admin user. This is a serious limitation. For example, so if we try to suspend our job , we get: % scontrol suspend slurm_suspend error: Access/permission denied % scontrol resume Good to Great: Choosing NetworkComputer over Slurm Page 10
11 slurm_suspend error: Access/permission denied In NetworkComputer, the owner of a job can suspend it and resume it. This is a basic functionality that is a nice to have for any practical usage. This capability can also be given to any user who has ADMIN privileges. % nc suspend vnc 02/20/ :51:59: message: Suspending job % nc suspend vnc 02/20/ :52:02: message: No need to suspend : it is suspended. # nc resume vnc 02/20/ :52:51: message: Resuming job % nc resume vnc 02/20/ :52:56: message: No need to resume : it is running. Another ability is to preempt a job, with nc preempt: % nc preempt In this case, the job is suspended, all resources associated with the job are freed (including licenses and CPUs) and those resources are made available to other "more important" jobs in the queue. If no such job exists, then the preempted job is automatically resumed. Slurm has a similar but less feature set preemption capability. Comparing how NetworkComputer vs. Slurm Handles Dependencies In Slurm, to execute a job after another (e.g. job with id ) has completed, we can say: % squeue --dependency=afterok: /mysleep.csh In NetworkComputer we have a dependency similar to "afterok": % set j1 = `nc run -v 1 sleep 10` % nc run -dep $j1 sleep 2 In addition, NetworkComputer has a key advantage of a simple way of waiting for a job to complete, with nc wait, which does not to exist in Slurm: % set j1 = `nc run -v 1 sleep 10` % nc wait $j1 If we want to run one job at a time, in Slurm we can use the "singleton" dependency, while in NetworkComputer we can use the "-limit 1" option in "nc run": Good to Great: Choosing NetworkComputer over Slurm Page 11
12 NetworkComputer % nc run -limit 1 -array 1000 sleep 0 Slurm % sbatch -J myname --array= /mysleep0.csh Comparing how NetworkComputer vs. Slurm Manages Software Licenses In Slurm, the licenses can be represented by the "Licenses" lines in the slurm.config file: # Fragment of slurm.config Licenses=verilog:3,spice:2 In NetworkComputer, the licenses are sampled automatically, typically every 30 seconds, by the LicenseMonitor subsystem, which then immediately updates the scheduler in NetworkComputer. This allows the automatic tracking and management of all features that are being serviced by FLEXlm or any other license daemons. NetworkComputer typically handles many hundreds of these licenses. For commercial purposes, this is a much more robust system. Comparing NetworkComputer vs. Slurm Architecture In Slurm, the list of current jobs (less than 40k jobs) is held in the directory /var/lib/slurmllnl/slurmctld on the master node. Each job is a sub-directory which contains: The copy of the submission script The snapshot of the submission environment In NetworkComputer, all job information is kept efficiently into memory. Here is a snapshot of the two daemons running on the same machine after running each about 400,000 jobs: NetworkComputer: ncadmin ? S Feb17 1:56 vovserver -p nc Slurm: slurm ? Sl Feb17 68:05 /usr/sbin/slurmctld Good to Great: Choosing NetworkComputer over Slurm Page 12
13 Note that the NetworkComputer vovserver memory footprint is less than half the size of slurmctld memory footprint, even if it holds all 400k jobs in memory. It is thus observed that NetworkComputer s memory management is far superior than Slurm. So, you want to use NetworkComputer with Slurm? Yes, you can get the benefits of capacity and ease of use of NetworkComputer while using Slurm as the main allocator of computing resources. In situations where you need to retain Slurm for whatever reason, NetworkComputer can easily piggyback on Slurm. This is like having your own private scheduler for your workload without violating the rules of your organization. A sample method to test drive the goodness of NetworkComputer using computing resources from your existing Slurm installation is shown here: Install NetworkComputer on a shared file system: (example: in /remote/sw/runtime/ ) Setup your shell by sourcing one of the setup scripts, found in the installation directory: (example /remote/sw/runtime/ /common/etc/vovrc.{sh,csh}) Start your private scheduler with: % ncmgr start -dir. -queue my_vnc... % setenv NC_QUEUE my_vnc Create the following script, which start a transient vovslave on the current host: % cat ncslave.csh #!/bin/csh f # Start a slave with 1 slot, max load 100, for no more than 2 hours vovslaveroot -T 1 -M 100 -a "@HOST@_@PID@" -z 1m -Z 2h Request computing resources from Slurm: % vovproject enable my_vnc % sbatch./myslave.csh % sbatch --array=1-50./myslave.csh Now you can submit jobs to your NetworkComputer instance and use resources from Slurm. If you are the network administrator, you can someday consider moving the entire management of your clusters to NetworkComputer. Good to Great: Choosing NetworkComputer over Slurm Page 13
14 Summary Although Slurm is free, it has major limitations related to scalability and usability that prevent it from being a dependable solution for commercial applications. In fact, it s the reason why you won t find it being using in commercial settings that have serious reliability needs. Dealing with only lighter workloads, it lacks the capacity needed for every day needs. In addition, the user interface is raw and lacks user-friendly functions to gives user proper visibility into their jobs. With NetworkComputer, you get a robust enterprise grade job scheduler that scales with all workload types delivering high performance and capacity as well as GUIs for giving users maximum visibility into their jobs. As well, you receive enterprise level service so that you know you have full customer support for your mission critical needs. Today, NetworkComputer is the job scheduler of choice for major Fortune companies for these reasons. To get started with NetworkComputer, visit and sign up. Good to Great: Choosing NetworkComputer over Slurm Page 14
Submitting batch jobs Slurm on ecgate
Submitting batch jobs Slurm on ecgate Xavi Abellan xavier.abellan@ecmwf.int User Support Section Com Intro 2015 Submitting batch jobs ECMWF 2015 Slide 1 Outline Interactive mode versus Batch mode Overview
More informationTITANI CLUSTER USER MANUAL V.1.3
2016 TITANI CLUSTER USER MANUAL V.1.3 This document is intended to give some basic notes in order to work with the TITANI High Performance Green Computing Cluster of the Civil Engineering School (ETSECCPB)
More informationIntroduction to GACRC Teaching Cluster
Introduction to GACRC Teaching Cluster Georgia Advanced Computing Resource Center (GACRC) EITS/University of Georgia Zhuofei Hou zhuofei@uga.edu 1 Outline GACRC Overview Computing Resources Three Folders
More informationJuropa3 Experimental Partition
Juropa3 Experimental Partition Batch System SLURM User's Manual ver 0.2 Apr 2014 @ JSC Chrysovalantis Paschoulas c.paschoulas@fz-juelich.de Contents 1. System Information 2. Modules 3. Slurm Introduction
More informationIntroduction to GACRC Teaching Cluster PHYS8602
Introduction to GACRC Teaching Cluster PHYS8602 Georgia Advanced Computing Resource Center (GACRC) EITS/University of Georgia Zhuofei Hou zhuofei@uga.edu 1 Outline GACRC Overview Computing Resources Three
More informationHow to access Geyser and Caldera from Cheyenne. 19 December 2017 Consulting Services Group Brian Vanderwende
How to access Geyser and Caldera from Cheyenne 19 December 2017 Consulting Services Group Brian Vanderwende Geyser nodes useful for large-scale data analysis and post-processing tasks 16 nodes with: 40
More informationIntroduction to High-Performance Computing (HPC)
Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit cores : individual processing units within a CPU Storage : Disk drives HDD : Hard Disk Drive SSD : Solid
More informationIntroduction to GACRC Teaching Cluster
Introduction to GACRC Teaching Cluster Georgia Advanced Computing Resource Center (GACRC) EITS/University of Georgia Zhuofei Hou zhuofei@uga.edu 1 Outline GACRC Overview Computing Resources Three Folders
More informationSubmitting batch jobs
Submitting batch jobs SLURM on ECGATE Xavi Abellan Xavier.Abellan@ecmwf.int ECMWF February 20, 2017 Outline Interactive mode versus Batch mode Overview of the Slurm batch system on ecgate Batch basic concepts
More informationFederated Cluster Support
Federated Cluster Support Brian Christiansen and Morris Jette SchedMD LLC Slurm User Group Meeting 2015 Background Slurm has long had limited support for federated clusters Most commands support a --cluster
More information1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents Overview. Principal concepts. Architecture. Scheduler Policies. 2 Bull, 2011 Bull Extreme Computing SLURM Overview Ares, Gerardo, HPC Team Introduction
More informationSlurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC
Slurm Overview Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17 Outline Roles of a resource manager and job scheduler Slurm description and design goals Slurm architecture and plugins Slurm
More informationPBS PROFESSIONAL VS. MICROSOFT HPC PACK
PBS PROFESSIONAL VS. MICROSOFT HPC PACK On the Microsoft Windows Platform PBS Professional offers many features which are not supported by Microsoft HPC Pack. SOME OF THE IMPORTANT ADVANTAGES OF PBS PROFESSIONAL
More informationResource Management at LLNL SLURM Version 1.2
UCRL PRES 230170 Resource Management at LLNL SLURM Version 1.2 April 2007 Morris Jette (jette1@llnl.gov) Danny Auble (auble1@llnl.gov) Chris Morrone (morrone2@llnl.gov) Lawrence Livermore National Laboratory
More informationExercises: Abel/Colossus and SLURM
Exercises: Abel/Colossus and SLURM November 08, 2016 Sabry Razick The Research Computing Services Group, USIT Topics Get access Running a simple job Job script Running a simple job -- qlogin Customize
More informationHigh Performance Computing Cluster Advanced course
High Performance Computing Cluster Advanced course Jeremie Vandenplas, Gwen Dawes 9 November 2017 Outline Introduction to the Agrogenomics HPC Submitting and monitoring jobs on the HPC Parallel jobs on
More informationIntroduction to Slurm
Introduction to Slurm Tim Wickberg SchedMD Slurm User Group Meeting 2017 Outline Roles of resource manager and job scheduler Slurm description and design goals Slurm architecture and plugins Slurm configuration
More informationVisual Design Flows for Faster Debug and Time to Market FlowTracer White Paper
Visual Design Flows for Faster Debug and Time to Market FlowTracer White Paper 2560 Mission College Blvd., Suite 130 Santa Clara, CA 95054 (408) 492-0940 Introduction As System-on-Chip (SoC) designs have
More informationBatch Usage on JURECA Introduction to Slurm. May 2016 Chrysovalantis Paschoulas HPS JSC
Batch Usage on JURECA Introduction to Slurm May 2016 Chrysovalantis Paschoulas HPS group @ JSC Batch System Concepts Resource Manager is the software responsible for managing the resources of a cluster,
More informationIntroduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU
Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU What is Joker? NMSU s supercomputer. 238 core computer cluster. Intel E-5 Xeon CPUs and Nvidia K-40 GPUs. InfiniBand innerconnect.
More informationVersions and 14.11
Slurm Update Versions 14.03 and 14.11 Jacob Jenson jacob@schedmd.com Yiannis Georgiou yiannis.georgiou@bull.net V14.03 - Highlights Support for native Slurm operation on Cray systems (without ALPS) Run
More informationNEUTRO Quick Start Guide. Version
NEUTRO Quick Start Guide Version 2017.1 Copyright Copyright 2012-2017, NICE s.r.l. All right reserved. We'd Like to Hear from You You can help us make this document better by telling us what you think
More informationSherlock for IBIIS. William Law Stanford Research Computing
Sherlock for IBIIS William Law Stanford Research Computing Overview How we can help System overview Tech specs Signing on Batch submission Software environment Interactive jobs Next steps We are here to
More informationHigh Performance Computing Cluster Basic course
High Performance Computing Cluster Basic course Jeremie Vandenplas, Gwen Dawes 30 October 2017 Outline Introduction to the Agrogenomics HPC Connecting with Secure Shell to the HPC Introduction to the Unix/Linux
More informationDuke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/
Duke Compute Cluster Workshop 11/10/2016 Tom Milledge h:ps://rc.duke.edu/ rescompu>ng@duke.edu Outline of talk Overview of Research Compu>ng resources Duke Compute Cluster overview Running interac>ve and
More informationIntroduction to High-Performance Computing (HPC)
Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit cores : individual processing units within a CPU Storage : Disk drives HDD : Hard Disk Drive SSD : Solid
More informationDuke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu
Duke Compute Cluster Workshop 3/28/2018 Tom Milledge rc.duke.edu rescomputing@duke.edu Outline of talk Overview of Research Computing resources Duke Compute Cluster overview Running interactive and batch
More informationHigh Performance Computing. ICRAR/CASS Radio School Oct 2, 2018
High Performance Computing ICRAR/CASS Radio School Oct 2, 2018 Overview Intro to Pawsey Supercomputing Centre Architecture of a supercomputer Basics of parallel computing Filesystems Software environment
More informationHeterogeneous Job Support
Heterogeneous Job Support Tim Wickberg SchedMD SC17 Submitting Jobs Multiple independent job specifications identified in command line using : separator The job specifications are sent to slurmctld daemon
More informationIntroduction to SLURM & SLURM batch scripts
Introduction to SLURM & SLURM batch scripts Anita Orendt Assistant Director Research Consulting & Faculty Engagement anita.orendt@utah.edu 6 February 2018 Overview of Talk Basic SLURM commands SLURM batch
More informationJune Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center. Carrie Brown, Adam Caprez
June Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center Carrie Brown, Adam Caprez Setup Instructions Please complete these steps before the lessons start
More informationOnline Demo Guide. Barracuda PST Enterprise. Introduction (Start of Demo) Logging into the PST Enterprise
Online Demo Guide Barracuda PST Enterprise This script provides an overview of the main features of PST Enterprise, covering: 1. Logging in to PST Enterprise 2. Client Configuration 3. Global Configuration
More informationBatch Systems & Parallel Application Launchers Running your jobs on an HPC machine
Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike
More informationSLURM Operation on Cray XT and XE
SLURM Operation on Cray XT and XE Morris Jette jette@schedmd.com Contributors and Collaborators This work was supported by the Oak Ridge National Laboratory Extreme Scale Systems Center. Swiss National
More informationIntroduction to SLURM & SLURM batch scripts
Introduction to SLURM & SLURM batch scripts Anita Orendt Assistant Director Research Consulting & Faculty Engagement anita.orendt@utah.edu 16 Feb 2017 Overview of Talk Basic SLURM commands SLURM batch
More informationSlurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013
Slurm and Abel job scripts Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013 Abel in numbers Nodes - 600+ Cores - 10000+ (1 node->2 processors->16 cores) Total memory
More informationSlurm basics. Summer Kickstart June slide 1 of 49
Slurm basics Summer Kickstart 2017 June 2017 slide 1 of 49 Triton layers Triton is a powerful but complex machine. You have to consider: Connecting (ssh) Data storage (filesystems and Lustre) Resource
More informationHigh Throughput Computing with SLURM. SLURM User Group Meeting October 9-10, 2012 Barcelona, Spain
High Throughput Computing with SLURM SLURM User Group Meeting October 9-10, 2012 Barcelona, Spain Morris Jette and Danny Auble [jette,da]@schedmd.com Thanks to This work is supported by the Oak Ridge National
More informationBatch Systems. Running your jobs on an HPC machine
Batch Systems Running your jobs on an HPC machine Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationSmartSuspend. Achieve 100% Cluster Utilization. Technical Overview
SmartSuspend Achieve 100% Cluster Utilization Technical Overview 2011 Jaryba, Inc. SmartSuspend TM Technical Overview 1 Table of Contents 1.0 SmartSuspend Overview 3 2.0 How SmartSuspend Works 3 3.0 Job
More informationSlurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012
Slurm and Abel job scripts Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012 Abel in numbers Nodes - 600+ Cores - 10000+ (1 node->2 processors->16 cores) Total memory
More informationIntroduction to UBELIX
Science IT Support (ScITS) Michael Rolli, Nico Färber Informatikdienste Universität Bern 06.06.2017, Introduction to UBELIX Agenda > Introduction to UBELIX (Overview only) Other topics spread in > Introducing
More informationUniva Grid Engine Troubleshooting Quick Reference
Univa Corporation Grid Engine Documentation Univa Grid Engine Troubleshooting Quick Reference Author: Univa Engineering Version: 8.4.4 October 31, 2016 Copyright 2012 2016 Univa Corporation. All rights
More informationUsing and Modifying the BSC Slurm Workload Simulator. Slurm User Group Meeting 2015 Stephen Trofinoff and Massimo Benini, CSCS September 16, 2015
Using and Modifying the BSC Slurm Workload Simulator Slurm User Group Meeting 2015 Stephen Trofinoff and Massimo Benini, CSCS September 16, 2015 Using and Modifying the BSC Slurm Workload Simulator The
More informationQuick Guide for the Torque Cluster Manager
Quick Guide for the Torque Cluster Manager Introduction: One of the main purposes of the Aries Cluster is to accommodate especially long-running programs. Users who run long jobs (which take hours or days
More informationIntroduction to the Cluster
Introduction to the Cluster Advanced Computing Center for Research and Education http://www.accre.vanderbilt.edu Follow us on Twitter for important news and updates: @ACCREVandy The Cluster We will be
More informationTutorial 4: Condor. John Watt, National e-science Centre
Tutorial 4: Condor John Watt, National e-science Centre Tutorials Timetable Week Day/Time Topic Staff 3 Fri 11am Introduction to Globus J.W. 4 Fri 11am Globus Development J.W. 5 Fri 11am Globus Development
More informationUMass High Performance Computing Center
.. UMass High Performance Computing Center University of Massachusetts Medical School October, 2015 2 / 39. Challenges of Genomic Data It is getting easier and cheaper to produce bigger genomic data every
More informationIntroduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende
Introduction to the NCAR HPC Systems 25 May 2018 Consulting Services Group Brian Vanderwende Topics to cover Overview of the NCAR cluster resources Basic tasks in the HPC environment Accessing pre-built
More informationPROOF-Condor integration for ATLAS
PROOF-Condor integration for ATLAS G. Ganis,, J. Iwaszkiewicz, F. Rademakers CERN / PH-SFT M. Livny, B. Mellado, Neng Xu,, Sau Lan Wu University Of Wisconsin Condor Week, Madison, 29 Apr 2 May 2008 Outline
More informationSLURM Workload and Resource Management in HPC
SLURM Workload and Resource Management in HPC Users and Administrators Tutorial 02/07/15 Yiannis Georgiou R&D Sofware Architect Bull, 2012 1 Introduction SLURM scalable and flexible RJMS Part 1: Basics
More informationSlurm Birds of a Feather
Slurm Birds of a Feather Tim Wickberg SchedMD SC17 Outline Welcome Roadmap Review of 17.02 release (Februrary 2017) Overview of upcoming 17.11 (November 2017) release Roadmap for 18.08 and beyond Time
More informationHPC Introductory Course - Exercises
HPC Introductory Course - Exercises The exercises in the following sections will guide you understand and become more familiar with how to use the Balena HPC service. Lines which start with $ are commands
More informationIntroduction to SLURM & SLURM batch scripts
Introduction to SLURM & SLURM batch scripts Anita Orendt Assistant Director Research Consulting & Faculty Engagement anita.orendt@utah.edu 23 June 2016 Overview of Talk Basic SLURM commands SLURM batch
More informationHTCondor Essentials. Index
HTCondor Essentials 31.10.2017 Index Login How to submit a job in the HTCondor pool Why the -name option? Submitting a job Checking status of submitted jobs Getting id and other info about a job
More informationIntroduction to SLURM on the High Performance Cluster at the Center for Computational Research
Introduction to SLURM on the High Performance Cluster at the Center for Computational Research Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St Buffalo, NY
More informationBright Cluster Manager: Using the NVIDIA NGC Deep Learning Containers
Bright Cluster Manager: Using the NVIDIA NGC Deep Learning Containers Technical White Paper Table of Contents Pre-requisites...1 Setup...2 Run PyTorch in Kubernetes...3 Run PyTorch in Singularity...4 Run
More informationSlurm Version Overview
Slurm Version 18.08 Overview Brian Christiansen SchedMD Slurm User Group Meeting 2018 Schedule Previous major release was 17.11 (November 2017) Latest major release 18.08 (August 2018) Next major release
More informationHighly Available Forms and Reports Applications with Oracle Fail Safe 3.0
Highly Available Forms and Reports Applications with Oracle Fail Safe 3.0 High Availability for Windows NT An Oracle Technical White Paper Robert Cheng Oracle New England Development Center System Products
More informationUsing Docker in High Performance Computing in OpenPOWER Environment
Using Docker in High Performance Computing in OpenPOWER Environment Zhaohui Ding, Senior Product Architect Sam Sanjabi, Advisory Software Engineer IBM Platform Computing #OpenPOWERSummit Join the conversation
More informationWorkload management at KEK/CRC -- status and plan
Workload management at KEK/CRC -- status and plan KEK/CRC Hiroyuki Matsunaga Most of the slides are prepared by Koichi Murakami and Go Iwai CPU in KEKCC Work server & Batch server Xeon 5670 (2.93 GHz /
More informationAnnouncement. Exercise #2 will be out today. Due date is next Monday
Announcement Exercise #2 will be out today Due date is next Monday Major OS Developments 2 Evolution of Operating Systems Generations include: Serial Processing Simple Batch Systems Multiprogrammed Batch
More informationIntroduction to RCC. September 14, 2016 Research Computing Center
Introduction to HPC @ RCC September 14, 2016 Research Computing Center What is HPC High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers
More informationIntroduction to BioHPC
Introduction to BioHPC New User Training [web] [email] portal.biohpc.swmed.edu biohpc-help@utsouthwestern.edu 1 Updated for 2015-06-03 Overview Today we re going to cover: What is BioHPC? How do I access
More informationIntroduction to the Cluster
Follow us on Twitter for important news and updates: @ACCREVandy Introduction to the Cluster Advanced Computing Center for Research and Education http://www.accre.vanderbilt.edu The Cluster We will be
More informationIntroduction to RCC. January 18, 2017 Research Computing Center
Introduction to HPC @ RCC January 18, 2017 Research Computing Center What is HPC High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much
More informationCRUK cluster practical sessions (SLURM) Part I processes & scripts
CRUK cluster practical sessions (SLURM) Part I processes & scripts login Log in to the head node, clust1-headnode, using ssh and your usual user name & password. SSH Secure Shell 3.2.9 (Build 283) Copyright
More informationBright Cluster Manager
Bright Cluster Manager Using Slurm for Data Aware Scheduling in the Cloud Martijn de Vries CTO About Bright Computing Bright Computing 1. Develops and supports Bright Cluster Manager for HPC systems, server
More informationDay 9: Introduction to CHTC
Day 9: Introduction to CHTC Suggested reading: Condor 7.7 Manual: http://www.cs.wisc.edu/condor/manual/v7.7/ Chapter 1: Overview Chapter 2: Users Manual (at most, 2.1 2.7) 1 Turn In Homework 2 Homework
More informationDuke Compute Cluster Workshop. 10/04/2018 Tom Milledge rc.duke.edu
Duke Compute Cluster Workshop 10/04/2018 Tom Milledge rc.duke.edu rescomputing@duke.edu Outline of talk Overview of Research Computing resources Duke Compute Cluster overview Running interactive and batch
More informationJob Management System Extension To Support SLAAC-1V Reconfigurable Hardware
Job Management System Extension To Support SLAAC-1V Reconfigurable Hardware Mohamed Taher 1, Kris Gaj 2, Tarek El-Ghazawi 1, and Nikitas Alexandridis 1 1 The George Washington University 2 George Mason
More informationMoab Passthrough. Administrator Guide February 2018
Moab Passthrough Administrator Guide 9.1.2 February 2018 2018 Adaptive Computing Enterprises, Inc. All rights reserved. Distribution of this document for commercial purposes in either hard or soft copy
More informationGet your own Galaxy within minutes
Get your own Galaxy within minutes Enis Afgan, Nitesh Turaga, Nuwan Goonasekera GCC 2016 Bloomington, IN Access slides from bit.ly/gcc2016_usecloud Today s agenda Introduction Hands on, part 1 Launch your
More informationPBS Pro Documentation
Introduction Most jobs will require greater resources than are available on individual nodes. All jobs must be scheduled via the batch job system. The batch job system in use is PBS Pro. Jobs are submitted
More informationCycleServer Grid Engine Support Install Guide. version
CycleServer Grid Engine Support Install Guide version 1.34.4 Contents CycleServer Grid Engine Guide 1 Administration 1 Requirements 1 Installation 1 Monitoring Additional Grid Engine Clusters 3 Monitoring
More informationCS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017
CS 471 Operating Systems Yue Cheng George Mason University Fall 2017 Outline o Process concept o Process creation o Process states and scheduling o Preemption and context switch o Inter-process communication
More informationTroubleshooting Jobs on Odyssey
Troubleshooting Jobs on Odyssey Paul Edmon, PhD ITC Research CompuGng Associate Bob Freeman, PhD Research & EducaGon Facilitator XSEDE Campus Champion Goals Tackle PEND, FAIL, and slow performance issues
More informationAndrej Filipčič
Singularity@SiGNET Andrej Filipčič SiGNET 4.5k cores, 3PB storage, 4.8.17 kernel on WNs and Gentoo host OS 2 ARC-CEs with 700TB cephfs ARC cache and 3 data delivery nodes for input/output file staging
More informationInstalling and Configuring VMware Identity Manager Connector (Windows) OCT 2018 VMware Identity Manager VMware Identity Manager 3.
Installing and Configuring VMware Identity Manager Connector 2018.8.1.0 (Windows) OCT 2018 VMware Identity Manager VMware Identity Manager 3.3 You can find the most up-to-date technical documentation on
More informationRHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK
RHRK-Seminar High Performance Computing with the Cluster Elwetritsch - II Course instructor : Dr. Josef Schüle, RHRK Overview Course I Login to cluster SSH RDP / NX Desktop Environments GNOME (default)
More informationChapter 8. Operating System Support. Yonsei University
Chapter 8 Operating System Support Contents Operating System Overview Scheduling Memory Management Pentium II and PowerPC Memory Management 8-2 OS Objectives & Functions OS is a program that Manages the
More information07 - Processes and Jobs
07 - Processes and Jobs CS 2043: Unix Tools and Scripting, Spring 2016 [1] Stephen McDowell February 10th, 2016 Cornell University Table of contents 1. Processes Overview 2. Modifying Processes 3. Jobs
More informationJob Management on LONI and LSU HPC clusters
Job Management on LONI and LSU HPC clusters Le Yan HPC Consultant User Services @ LONI Outline Overview Batch queuing system Job queues on LONI clusters Basic commands The Cluster Environment Multiple
More informationDirections in Workload Management
Directions in Workload Management Alex Sanchez and Morris Jette SchedMD LLC HPC Knowledge Meeting 2016 Areas of Focus Scalability Large Node and Core Counts Power Management Failure Management Federated
More informationSlurm Workload Manager Introductory User Training
Slurm Workload Manager Introductory User Training David Bigagli david@schedmd.com SchedMD LLC Outline Roles of resource manager and job scheduler Slurm design and architecture Submitting and running jobs
More informationIRIX Resource Management Plans & Status
IRIX Resource Management Plans & Status Dan Higgins Engineering Manager, Resource Management Team, SGI E-mail: djh@sgi.com CUG Minneapolis, May 1999 Abstract This paper will detail what work has been done
More informationOVERVIEW OF THE SAS GRID
OVERVIEW OF THE SAS GRID Host Caroline Scottow Presenter Peter Hobart MANAGING THE WEBINAR In Listen Mode Control bar opened with the white arrow in the orange box Copyr i g ht 2012, SAS Ins titut e Inc.
More informationHPC Introductory Training. on Balena by Team Bath
HPC Introductory Training on Balena by Team HPC @ Bath What is HPC and why is it different to using your desktop? High Performance Computing most generally refers to the practice of aggregating computing
More informationLAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
LAB Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012 1 Discovery
More informationTesting SLURM open source batch system for a Tierl/Tier2 HEP computing facility
Journal of Physics: Conference Series OPEN ACCESS Testing SLURM open source batch system for a Tierl/Tier2 HEP computing facility Recent citations - A new Self-Adaptive dispatching System for local clusters
More informationand how to use TORQUE & Maui Piero Calucci
Queue and how to use & Maui Scuola Internazionale Superiore di Studi Avanzati Trieste November 2008 Advanced School in High Performance and Grid Computing Outline 1 We Are Trying to Solve 2 Using the Manager
More informationCOPYRIGHTED MATERIAL. Introducing VMware Infrastructure 3. Chapter 1
Mccain c01.tex V3-04/16/2008 5:22am Page 1 Chapter 1 Introducing VMware Infrastructure 3 VMware Infrastructure 3 (VI3) is the most widely used virtualization platform available today. The lineup of products
More informationQueue systems. and how to use Torque/Maui. Piero Calucci. Scuola Internazionale Superiore di Studi Avanzati Trieste
Queue systems and how to use Torque/Maui Piero Calucci Scuola Internazionale Superiore di Studi Avanzati Trieste March 9th 2007 Advanced School in High Performance Computing Tools for e-science Outline
More informationHosts & Partitions. Slurm Training 15. Jordi Blasco & Alfred Gil (HPCNow!)
Slurm Training 15 Agenda 1 2 Compute Hosts State of the node FrontEnd Hosts FrontEnd Hosts Control Machine Define Partitions Job Preemption 3 4 Define Limits Define ACLs Shared resources Partition States
More informationDesign and deliver cloud-based apps and data for flexible, on-demand IT
White Paper Design and deliver cloud-based apps and data for flexible, on-demand IT Design and deliver cloud-based apps and data for flexible, on-demand IT Discover the fastest and easiest way for IT to
More informationAn introduction to checkpointing. for scientific applications
damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI An introduction to checkpointing for scientific applications November 2013 CISM/CÉCI training session What is checkpointing? Without checkpointing: $./count
More informationSlurm Support for Linux Control Groups
Slurm Support for Linux Control Groups Slurm User Group 2010, Paris, France, Oct 5 th 2010 Martin Perry Bull Information Systems Phoenix, Arizona martin.perry@bull.com cgroups Concepts Control Groups (cgroups)
More informationIntroduction to Operating Systems (Part II)
Introduction to Operating Systems (Part II) Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/24 1 / 45 Computer
More informationHPC Introductory Training. on Balena by Team Bath
HPC Introductory Training on Balena by Team HPC @ Bath Housekeeping Attendance sheet Fire alarm Refreshment breaks Questions anytime lets us know if you need any assistance. Feedback at the end of the
More informationUsing Cartesius and Lisa. Zheng Meyer-Zhao - Consultant Clustercomputing
Zheng Meyer-Zhao - zheng.meyer-zhao@surfsara.nl Consultant Clustercomputing Outline SURFsara About us What we do Cartesius and Lisa Architectures and Specifications File systems Funding Hands-on Logging
More information