Good to Great: Choosing NetworkComputer over Slurm

Size: px
Start display at page:

Download "Good to Great: Choosing NetworkComputer over Slurm"

Transcription

1 Good to Great: Choosing NetworkComputer over Slurm NetworkComputer White Paper 2560 Mission College Blvd., Suite 130 Santa Clara, CA (408)

2 Introduction Are you considering Slurm as your job scheduler, or are you currently a user of Slurm and wondering if it is right for you because of some issues you have been encountering? If you are an administrator or user who cares about efficiency and reliability to handle high-volume workloads, it may be worthwhile to consider NetworkComputer, a more powerful, efficient and reliable commercial job scheduler that is the industry standard for scalable high performance computing. Also, if you are a Slurm user, you may be wondering how Slurm compares to NetworkComputer before making a move. This white paper helps bridge that knowledge gap so that you can see it is relatively easy to migrate from Slurm to NetworkComputer. What is Slurm? Slurm is a free workload manager that has been around since It has known limitations in scaling and meeting job capacity needs along with its inability to fully utilize all available computing resources, thus has limited applications in commercial markets due to lack of robustness. Also, it lacks monitoring capabilities, which is a major pitfall. What is NetworkComputer? NetworkComputer by Runtime, a commercial enterprise-grade job scheduler, has some similar basic capabilities compared to Slurm however offers much more practical value to the end user for every day professional use. Being a commercial scheduler, used by top companies in the world, it is many times more scalable in capacity and performance and it is much easier to use. As the industry s fastest job scheduler, NetworkComputer is built to be "light-weight" and easy to use, thus it can be deployed also as a private scheduler to be used by a single person, by a group, or by a project. If your concerns for productivity include achieving the most efficient utilization of your expensive licenses and hardware resources, NetworkComputer will best fit your needs. Comparing NetworkComputer vs. Slurm Terminology In Slurm, the central component is called slurmctld. It manages the workload and all scheduling. Each computer (referred to as a "node" by Slurm) runs a daemon called slurmd which does some analysis of the computer it is running on and then accepts jobs sent from slurmctld. The configuration for a Slurm cluster is typically kept in a single file, typically found here: /etc/slurm-llnl/slurm.config. Good to Great: Choosing NetworkComputer over Slurm Page 2

3 In NetworkComputer, the central component is called vovserver and the daemon running on each remote computer is called vovslave. The configuration is spread over several files all contained in a directory called "vnc.swd" (pronounced "swid"), which is the "Server Working Directory". NetworkComputer has a file to describe the list of slaves and another file to describe the list of resources like licenses and limits. Everything is "elastic" in the sense that slaves and resources can be added at any time and the characteristics of a vovslave can be modified at will. This flexibility is important and a key reason why NetworkComputer is used for commercial purposes. By default, vovslaves automatically detect all characteristics of the machine they are running on, including RAM and CORES. Compare this to Slurm which is nonelastic in behavior. NetworkComputer Slurm Description vovserver slurmctld The hub of the system, manager of workload and scheduling vovslave slurmd Agent to execute jobs on a "node".../vnc.swd/slaves.tcl.../vnc.swd/resources.tcl... /etc/slurm-llnl/slurm.conf Configuration files Comparing NetworkComputer vs. Slurm Commands Slurm's command line interface consists of a few commands like sbatch, scancel, squeue, sinfo, scontrol, smap. NetworkComputer command line interaction is based on two commands: ncmgr used by the manager to start and stop the system, and nc <command> used by all users. Here is the usage message from nc, where boldface is used to highlight the commands that will be mentioned in this introduction: Good to Great: Choosing NetworkComputer over Slurm Page 3

4 % nc nc: Usage Message Usage: nc [-q queuename] <command> [command options] Queue selection: The default queue is called "vnc". You can specify a different queue with the option -q <queuename> or by setting the environment variable NC_QUEUE. Commands: clean debug dispatch forget getfield gui help hosts info list jobclass kerberos modify monitor rerun resources resume run <job> preempt slavelist stop submit <job> summary suspend wait why Cleanup log files and env files. Show how to run the same job without NetworkComputer. Force dispatch of a job to a specific slave. Forget old jobs from the system. Get a field for a job. Start a simple graphical interface. This help message. Show farm hosts (also called slaves). Get information about a job and its outputs. List the jobs in the system. List the available job classes. Interface to Kerberos (experimental). Modify attributes of scheduled jobs. Monitor network activity. Rerun a job already known to the system. Shows resource list and current statistics. Resume a job previously suspended. Run a new job (also called 'submit'). Preempt a job. Show available slave lists. Stop jobs. Same as 'run'. Get a summary report for all my jobs. Suspend the execution of a job. Wait for a job to complete. Analyze job status reasons. Unique abbreviations for commands are accepted. Advanced features: cmd <command> Execute an arbitrary VOV command in the context of the NetworkComputer server. source <file.tcl> Source the given Tcl file. - Accept commands from stdin. For more help type: % nc <command> -h Copyright (c) , Runtime Design Automation. In Slurm, you need to write a script to submit a command whereas NetworkComputer allows for the submission of any type of command. For example, if one wants to submit to Slurm the command "sleep 0", a script like this must be used: Good to Great: Choosing NetworkComputer over Slurm Page 4

5 #!/bin/csh -f # This is my script called./sleep0.csh sleep 0 NetworkComputer Slurm Description nc run [OPTIONS]./myscript.csh nc run [OPTIONS] sleep 0 sbatch [OPTIONS]./myscript.csh nc stop... scancel... Methods to submit batch jobs De-schedule submitted jobs. Stop jobs if they are running nc list squeue List jobs in the system nc info JOBID nc getfield JOBID nc wait JOBID scontrol show job JOBID Detailed information about one job. Wait for the specified node/job to be done nc gui & smap Graphical visualization of jobs nc hostsnc resourcesnc cmd vsi nc hosts nc monitor sinfo sinfo -N Various commands to show information about the system List information about machines connected to scheduler Jobs Visualization and Interactive Queries in NetworkComputer In Slurm, there is no comprehensive facility to visualize your job status or point-and-click drill down easily for debugging. In NetworkComputer, you are provided with an interactive GUI where you can visualize all of the scheduled job statuses as well as have the ability to drill down into any job to get real-time details. Lots of other information such as workload and resource details are available. Figure 1: The NetworkComputer GUI shows jobs as colored boxes. The green jobs are done, the red jobs have failed, the orange jobs are currently running, the cyan jobs are waiting for resources to become available. Good to Great: Choosing NetworkComputer over Slurm Page 5

6 Figure 2: In NetworkComputer, you can customize your view to show specific job details that matter easily in the GUI in each box. You can also easily drill down to get more job details. Figure 3: NetworkComputer gives users views about workload and resources. Comparing NetworkComputer Performance vs. Slurm for Light Workloads Good to Great: Choosing NetworkComputer over Slurm Page 6

7 In this example, the Slurm cluster consists of three identical desktops, called node1, node2, node3, with the master running on node3. The NetworkComputer setup uses the same hardware, with the server running on node2. With a light load, the difference between Slurm and NetworkComputer is negligible: In Slurm: % sbatch./sleep0.csh Submitted batch job % scontrol show job JobId= Name=sleep0.csh UserId=joe(1024) GroupId=joe(1002) Priority= Account=(null) QOS=(null) JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime= T11:11:42 EligibleTime= T11:11:42 StartTime= T11:11:42 EndTime= T11:11:42 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=debug AllocNode:Sid=node2:22587 ReqNodeList=(null) ExcNodeList=(null) NodeList=node2 BatchHost=node2 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/joe/tmp/./sleep0.csh WorkDir=/home/joe/tmp In NetworkComputer: % nc run sleep 0 Fairshare= /time/users Resources= linux64 Env = SNAPSHOT(vnc_logs/snapshots/joe/linux64/env26253.env) Command = vw sleep 0 Logfile = vnc_logs/ / JobURL = JobId = % nc info Id,User,Group ,joe.joe,/time/users.joe Environment SNAPSHOT(vnc_logs/snapshots/joe/linux64/env26253.env) Directory /home/joe Command sleep 0 Resources linux64 Submitted from node2 Submitted at Wed Jan 18 11:11:16 PST 2017 Priorities schedule=normal execution=low PlacementPolicy fastest,pack Status Done Host localhost Slave localhost QueueTime 0s Good to Great: Choosing NetworkComputer over Slurm Page 7

8 CPUTime 0.01 MaxRAM 0MB Duration 0 Age 1m31s AutoForget 1 Job is Done Main Reason: This job successfully executed. To simplify automation, NetworkComputer helps the developer in simple but effective ways, such as: The option -v 1 in nc run, which returns only the ID of the submitted job The command nc getfield, which allows the direct access to one or more fields in the job without requiring any grep/awk work NetworkComputer % set id = `nc run -v 1 sleep 0` % nc wait $id % nc getfield $id status VALID No equivalent in Slurm NetworkComputer Outperforms Slurm for Normal to Heavy Workloads This is the major reason why Slurm is not fit for commercial needs. It cannot handle heavy loads. In fact, it struggles with less-than-heavy loads, as shown in the next example. In NetworkComputer, a constant load of 100,000 or more jobs in the queue is considered an ordinary load while choking Slurm. A million jobs in the queue is a heavy load, easily handled by NetworkComputer. Let us assume we have a workload consisting of 80,000 jobs. In Slurm, you may want to submit the jobs with an array. The maximum array size in our default installation seems to be 1000 elements, so we will need to submit 80 arrays. Our Slurm installation stops accepting jobs after less than 10,000 jobs are in the queue, which is a serious limitation, while NetworkComputer easily accepts the whole workload in about 6 seconds. Good to Great: Choosing NetworkComputer over Slurm Page 8

9 NetworkComputer % time repeat 80 nc run -v 0 -array 1000 sleep 0... omitting some output from 'time' u 0.008s 0: % 0+0k 0+8io 0pf+0w 0.052u 0.004s 0: % 0+0k 0+8io 0pf+0w 0.043u 0.008s 0: % 0+0k 0+8io 0pf+0w 0.051u 0.004s 0: % 0+0k 0+8io 0pf+0w 0.052u 0.000s 0: % 0+0k 0+8io 0pf+0w 3.974u 0.462s 0: % 0+0k 0+640io 0pf+0w % nc summary NC Summary For Set System:User:joe TOTAL JOBS 80,001 Duration: 3m15s Done 690 Queued 79,309 Running 2 Slurm % repeat 80 sbatch --array= /sleep0.csh Submitted batch job Submitted batch job Submitted batch job Submitted batch job Submitted batch job Submitted batch job Submitted batch job Submitted batch job Submitted batch job sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying. In a similar experiment, scaling much larger, NetworkComputer easily handles 880,000 jobs. To check the status of the workload, we can use the "summary" report option which is efficient, compact and easy to understand: (Slurm has no equivalent function) % nc summary -a -b NC Summary For Set System:jobs TOTAL JOBS 101,821 Duration: 37m40s Done 26,677 Queued 75,138 Running 4 BKT JOBS PRI AGE GROUP USER TOOL WAITING FOR , s /time/users.joe joe hostname HW linux64 Good to Great: Choosing NetworkComputer over Slurm Page 9

10 Comparing NetworkComputer vs. Slurm Scheduler Status In Slurm, to get the scheduler status, you can execute scontrol ping and get a very simple report: % scontrol ping Slurmctld(primary/backup) at node3/(null) are UP/DOWN In NetworkComputer, a common method to check status is through "nc cmd vsi" ( vsi stands for vov-server-info) where you get more meaningful relevant information: % nc cmd vsi Vov Server Information - 01/10/ :44:55 vnchq@node3:6271 URL: Jobs: 101,892 Workload: Files: 101,904 - running: 5 Sets: 22 - queued: 59,242 Retraces: 0 - done: 42,572 - failed: Slaves: 2 Buckets: 1 - busy: 1 Duration: 0s - full: 1 SchedulerTime: 0.00s Slots: TotalResources: 14 Pid: 825 Saved: 1h29m ago Size: MB TimeTolerance: 3s Recent jobs for user joe Done vw hostname > vnc_logs/ / Done vw hostname > vnc_logs/ / Done vw hostname > vnc_logs/ / Running vw hostname > vnc_logs/ / Running vw hostname > vnc_logs/ / Running vw hostname > vnc_logs/ / Comparing NetworkComputer vs. Slurm Suspension Capabilities In Slurm, you can suspend and resume a job but only if you are root or the admin user. This is a serious limitation. For example, so if we try to suspend our job , we get: % scontrol suspend slurm_suspend error: Access/permission denied % scontrol resume Good to Great: Choosing NetworkComputer over Slurm Page 10

11 slurm_suspend error: Access/permission denied In NetworkComputer, the owner of a job can suspend it and resume it. This is a basic functionality that is a nice to have for any practical usage. This capability can also be given to any user who has ADMIN privileges. % nc suspend vnc 02/20/ :51:59: message: Suspending job % nc suspend vnc 02/20/ :52:02: message: No need to suspend : it is suspended. # nc resume vnc 02/20/ :52:51: message: Resuming job % nc resume vnc 02/20/ :52:56: message: No need to resume : it is running. Another ability is to preempt a job, with nc preempt: % nc preempt In this case, the job is suspended, all resources associated with the job are freed (including licenses and CPUs) and those resources are made available to other "more important" jobs in the queue. If no such job exists, then the preempted job is automatically resumed. Slurm has a similar but less feature set preemption capability. Comparing how NetworkComputer vs. Slurm Handles Dependencies In Slurm, to execute a job after another (e.g. job with id ) has completed, we can say: % squeue --dependency=afterok: /mysleep.csh In NetworkComputer we have a dependency similar to "afterok": % set j1 = `nc run -v 1 sleep 10` % nc run -dep $j1 sleep 2 In addition, NetworkComputer has a key advantage of a simple way of waiting for a job to complete, with nc wait, which does not to exist in Slurm: % set j1 = `nc run -v 1 sleep 10` % nc wait $j1 If we want to run one job at a time, in Slurm we can use the "singleton" dependency, while in NetworkComputer we can use the "-limit 1" option in "nc run": Good to Great: Choosing NetworkComputer over Slurm Page 11

12 NetworkComputer % nc run -limit 1 -array 1000 sleep 0 Slurm % sbatch -J myname --array= /mysleep0.csh Comparing how NetworkComputer vs. Slurm Manages Software Licenses In Slurm, the licenses can be represented by the "Licenses" lines in the slurm.config file: # Fragment of slurm.config Licenses=verilog:3,spice:2 In NetworkComputer, the licenses are sampled automatically, typically every 30 seconds, by the LicenseMonitor subsystem, which then immediately updates the scheduler in NetworkComputer. This allows the automatic tracking and management of all features that are being serviced by FLEXlm or any other license daemons. NetworkComputer typically handles many hundreds of these licenses. For commercial purposes, this is a much more robust system. Comparing NetworkComputer vs. Slurm Architecture In Slurm, the list of current jobs (less than 40k jobs) is held in the directory /var/lib/slurmllnl/slurmctld on the master node. Each job is a sub-directory which contains: The copy of the submission script The snapshot of the submission environment In NetworkComputer, all job information is kept efficiently into memory. Here is a snapshot of the two daemons running on the same machine after running each about 400,000 jobs: NetworkComputer: ncadmin ? S Feb17 1:56 vovserver -p nc Slurm: slurm ? Sl Feb17 68:05 /usr/sbin/slurmctld Good to Great: Choosing NetworkComputer over Slurm Page 12

13 Note that the NetworkComputer vovserver memory footprint is less than half the size of slurmctld memory footprint, even if it holds all 400k jobs in memory. It is thus observed that NetworkComputer s memory management is far superior than Slurm. So, you want to use NetworkComputer with Slurm? Yes, you can get the benefits of capacity and ease of use of NetworkComputer while using Slurm as the main allocator of computing resources. In situations where you need to retain Slurm for whatever reason, NetworkComputer can easily piggyback on Slurm. This is like having your own private scheduler for your workload without violating the rules of your organization. A sample method to test drive the goodness of NetworkComputer using computing resources from your existing Slurm installation is shown here: Install NetworkComputer on a shared file system: (example: in /remote/sw/runtime/ ) Setup your shell by sourcing one of the setup scripts, found in the installation directory: (example /remote/sw/runtime/ /common/etc/vovrc.{sh,csh}) Start your private scheduler with: % ncmgr start -dir. -queue my_vnc... % setenv NC_QUEUE my_vnc Create the following script, which start a transient vovslave on the current host: % cat ncslave.csh #!/bin/csh f # Start a slave with 1 slot, max load 100, for no more than 2 hours vovslaveroot -T 1 -M 100 -a "@HOST@_@PID@" -z 1m -Z 2h Request computing resources from Slurm: % vovproject enable my_vnc % sbatch./myslave.csh % sbatch --array=1-50./myslave.csh Now you can submit jobs to your NetworkComputer instance and use resources from Slurm. If you are the network administrator, you can someday consider moving the entire management of your clusters to NetworkComputer. Good to Great: Choosing NetworkComputer over Slurm Page 13

14 Summary Although Slurm is free, it has major limitations related to scalability and usability that prevent it from being a dependable solution for commercial applications. In fact, it s the reason why you won t find it being using in commercial settings that have serious reliability needs. Dealing with only lighter workloads, it lacks the capacity needed for every day needs. In addition, the user interface is raw and lacks user-friendly functions to gives user proper visibility into their jobs. With NetworkComputer, you get a robust enterprise grade job scheduler that scales with all workload types delivering high performance and capacity as well as GUIs for giving users maximum visibility into their jobs. As well, you receive enterprise level service so that you know you have full customer support for your mission critical needs. Today, NetworkComputer is the job scheduler of choice for major Fortune companies for these reasons. To get started with NetworkComputer, visit and sign up. Good to Great: Choosing NetworkComputer over Slurm Page 14

Submitting batch jobs Slurm on ecgate

Submitting batch jobs Slurm on ecgate Submitting batch jobs Slurm on ecgate Xavi Abellan xavier.abellan@ecmwf.int User Support Section Com Intro 2015 Submitting batch jobs ECMWF 2015 Slide 1 Outline Interactive mode versus Batch mode Overview

More information

TITANI CLUSTER USER MANUAL V.1.3

TITANI CLUSTER USER MANUAL V.1.3 2016 TITANI CLUSTER USER MANUAL V.1.3 This document is intended to give some basic notes in order to work with the TITANI High Performance Green Computing Cluster of the Civil Engineering School (ETSECCPB)

More information

Introduction to GACRC Teaching Cluster

Introduction to GACRC Teaching Cluster Introduction to GACRC Teaching Cluster Georgia Advanced Computing Resource Center (GACRC) EITS/University of Georgia Zhuofei Hou zhuofei@uga.edu 1 Outline GACRC Overview Computing Resources Three Folders

More information

Juropa3 Experimental Partition

Juropa3 Experimental Partition Juropa3 Experimental Partition Batch System SLURM User's Manual ver 0.2 Apr 2014 @ JSC Chrysovalantis Paschoulas c.paschoulas@fz-juelich.de Contents 1. System Information 2. Modules 3. Slurm Introduction

More information

Introduction to GACRC Teaching Cluster PHYS8602

Introduction to GACRC Teaching Cluster PHYS8602 Introduction to GACRC Teaching Cluster PHYS8602 Georgia Advanced Computing Resource Center (GACRC) EITS/University of Georgia Zhuofei Hou zhuofei@uga.edu 1 Outline GACRC Overview Computing Resources Three

More information

How to access Geyser and Caldera from Cheyenne. 19 December 2017 Consulting Services Group Brian Vanderwende

How to access Geyser and Caldera from Cheyenne. 19 December 2017 Consulting Services Group Brian Vanderwende How to access Geyser and Caldera from Cheyenne 19 December 2017 Consulting Services Group Brian Vanderwende Geyser nodes useful for large-scale data analysis and post-processing tasks 16 nodes with: 40

More information

Introduction to High-Performance Computing (HPC)

Introduction to High-Performance Computing (HPC) Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit cores : individual processing units within a CPU Storage : Disk drives HDD : Hard Disk Drive SSD : Solid

More information

Introduction to GACRC Teaching Cluster

Introduction to GACRC Teaching Cluster Introduction to GACRC Teaching Cluster Georgia Advanced Computing Resource Center (GACRC) EITS/University of Georgia Zhuofei Hou zhuofei@uga.edu 1 Outline GACRC Overview Computing Resources Three Folders

More information

Submitting batch jobs

Submitting batch jobs Submitting batch jobs SLURM on ECGATE Xavi Abellan Xavier.Abellan@ecmwf.int ECMWF February 20, 2017 Outline Interactive mode versus Batch mode Overview of the Slurm batch system on ecgate Batch basic concepts

More information

Federated Cluster Support

Federated Cluster Support Federated Cluster Support Brian Christiansen and Morris Jette SchedMD LLC Slurm User Group Meeting 2015 Background Slurm has long had limited support for federated clusters Most commands support a --cluster

More information

1 Bull, 2011 Bull Extreme Computing

1 Bull, 2011 Bull Extreme Computing 1 Bull, 2011 Bull Extreme Computing Table of Contents Overview. Principal concepts. Architecture. Scheduler Policies. 2 Bull, 2011 Bull Extreme Computing SLURM Overview Ares, Gerardo, HPC Team Introduction

More information

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC Slurm Overview Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17 Outline Roles of a resource manager and job scheduler Slurm description and design goals Slurm architecture and plugins Slurm

More information

PBS PROFESSIONAL VS. MICROSOFT HPC PACK

PBS PROFESSIONAL VS. MICROSOFT HPC PACK PBS PROFESSIONAL VS. MICROSOFT HPC PACK On the Microsoft Windows Platform PBS Professional offers many features which are not supported by Microsoft HPC Pack. SOME OF THE IMPORTANT ADVANTAGES OF PBS PROFESSIONAL

More information

Resource Management at LLNL SLURM Version 1.2

Resource Management at LLNL SLURM Version 1.2 UCRL PRES 230170 Resource Management at LLNL SLURM Version 1.2 April 2007 Morris Jette (jette1@llnl.gov) Danny Auble (auble1@llnl.gov) Chris Morrone (morrone2@llnl.gov) Lawrence Livermore National Laboratory

More information

Exercises: Abel/Colossus and SLURM

Exercises: Abel/Colossus and SLURM Exercises: Abel/Colossus and SLURM November 08, 2016 Sabry Razick The Research Computing Services Group, USIT Topics Get access Running a simple job Job script Running a simple job -- qlogin Customize

More information

High Performance Computing Cluster Advanced course

High Performance Computing Cluster Advanced course High Performance Computing Cluster Advanced course Jeremie Vandenplas, Gwen Dawes 9 November 2017 Outline Introduction to the Agrogenomics HPC Submitting and monitoring jobs on the HPC Parallel jobs on

More information

Introduction to Slurm

Introduction to Slurm Introduction to Slurm Tim Wickberg SchedMD Slurm User Group Meeting 2017 Outline Roles of resource manager and job scheduler Slurm description and design goals Slurm architecture and plugins Slurm configuration

More information

Visual Design Flows for Faster Debug and Time to Market FlowTracer White Paper

Visual Design Flows for Faster Debug and Time to Market FlowTracer White Paper Visual Design Flows for Faster Debug and Time to Market FlowTracer White Paper 2560 Mission College Blvd., Suite 130 Santa Clara, CA 95054 (408) 492-0940 Introduction As System-on-Chip (SoC) designs have

More information

Batch Usage on JURECA Introduction to Slurm. May 2016 Chrysovalantis Paschoulas HPS JSC

Batch Usage on JURECA Introduction to Slurm. May 2016 Chrysovalantis Paschoulas HPS JSC Batch Usage on JURECA Introduction to Slurm May 2016 Chrysovalantis Paschoulas HPS group @ JSC Batch System Concepts Resource Manager is the software responsible for managing the resources of a cluster,

More information

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU What is Joker? NMSU s supercomputer. 238 core computer cluster. Intel E-5 Xeon CPUs and Nvidia K-40 GPUs. InfiniBand innerconnect.

More information

Versions and 14.11

Versions and 14.11 Slurm Update Versions 14.03 and 14.11 Jacob Jenson jacob@schedmd.com Yiannis Georgiou yiannis.georgiou@bull.net V14.03 - Highlights Support for native Slurm operation on Cray systems (without ALPS) Run

More information

NEUTRO Quick Start Guide. Version

NEUTRO Quick Start Guide. Version NEUTRO Quick Start Guide Version 2017.1 Copyright Copyright 2012-2017, NICE s.r.l. All right reserved. We'd Like to Hear from You You can help us make this document better by telling us what you think

More information

Sherlock for IBIIS. William Law Stanford Research Computing

Sherlock for IBIIS. William Law Stanford Research Computing Sherlock for IBIIS William Law Stanford Research Computing Overview How we can help System overview Tech specs Signing on Batch submission Software environment Interactive jobs Next steps We are here to

More information

High Performance Computing Cluster Basic course

High Performance Computing Cluster Basic course High Performance Computing Cluster Basic course Jeremie Vandenplas, Gwen Dawes 30 October 2017 Outline Introduction to the Agrogenomics HPC Connecting with Secure Shell to the HPC Introduction to the Unix/Linux

More information

Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/

Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/ Duke Compute Cluster Workshop 11/10/2016 Tom Milledge h:ps://rc.duke.edu/ rescompu>ng@duke.edu Outline of talk Overview of Research Compu>ng resources Duke Compute Cluster overview Running interac>ve and

More information

Introduction to High-Performance Computing (HPC)

Introduction to High-Performance Computing (HPC) Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit cores : individual processing units within a CPU Storage : Disk drives HDD : Hard Disk Drive SSD : Solid

More information

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu Duke Compute Cluster Workshop 3/28/2018 Tom Milledge rc.duke.edu rescomputing@duke.edu Outline of talk Overview of Research Computing resources Duke Compute Cluster overview Running interactive and batch

More information

High Performance Computing. ICRAR/CASS Radio School Oct 2, 2018

High Performance Computing. ICRAR/CASS Radio School Oct 2, 2018 High Performance Computing ICRAR/CASS Radio School Oct 2, 2018 Overview Intro to Pawsey Supercomputing Centre Architecture of a supercomputer Basics of parallel computing Filesystems Software environment

More information

Heterogeneous Job Support

Heterogeneous Job Support Heterogeneous Job Support Tim Wickberg SchedMD SC17 Submitting Jobs Multiple independent job specifications identified in command line using : separator The job specifications are sent to slurmctld daemon

More information

Introduction to SLURM & SLURM batch scripts

Introduction to SLURM & SLURM batch scripts Introduction to SLURM & SLURM batch scripts Anita Orendt Assistant Director Research Consulting & Faculty Engagement anita.orendt@utah.edu 6 February 2018 Overview of Talk Basic SLURM commands SLURM batch

More information

June Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center. Carrie Brown, Adam Caprez

June Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center. Carrie Brown, Adam Caprez June Workshop Series June 27th: All About SLURM University of Nebraska Lincoln Holland Computing Center Carrie Brown, Adam Caprez Setup Instructions Please complete these steps before the lessons start

More information

Online Demo Guide. Barracuda PST Enterprise. Introduction (Start of Demo) Logging into the PST Enterprise

Online Demo Guide. Barracuda PST Enterprise. Introduction (Start of Demo) Logging into the PST Enterprise Online Demo Guide Barracuda PST Enterprise This script provides an overview of the main features of PST Enterprise, covering: 1. Logging in to PST Enterprise 2. Client Configuration 3. Global Configuration

More information

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike

More information

SLURM Operation on Cray XT and XE

SLURM Operation on Cray XT and XE SLURM Operation on Cray XT and XE Morris Jette jette@schedmd.com Contributors and Collaborators This work was supported by the Oak Ridge National Laboratory Extreme Scale Systems Center. Swiss National

More information

Introduction to SLURM & SLURM batch scripts

Introduction to SLURM & SLURM batch scripts Introduction to SLURM & SLURM batch scripts Anita Orendt Assistant Director Research Consulting & Faculty Engagement anita.orendt@utah.edu 16 Feb 2017 Overview of Talk Basic SLURM commands SLURM batch

More information

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013 Slurm and Abel job scripts Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013 Abel in numbers Nodes - 600+ Cores - 10000+ (1 node->2 processors->16 cores) Total memory

More information

Slurm basics. Summer Kickstart June slide 1 of 49

Slurm basics. Summer Kickstart June slide 1 of 49 Slurm basics Summer Kickstart 2017 June 2017 slide 1 of 49 Triton layers Triton is a powerful but complex machine. You have to consider: Connecting (ssh) Data storage (filesystems and Lustre) Resource

More information

High Throughput Computing with SLURM. SLURM User Group Meeting October 9-10, 2012 Barcelona, Spain

High Throughput Computing with SLURM. SLURM User Group Meeting October 9-10, 2012 Barcelona, Spain High Throughput Computing with SLURM SLURM User Group Meeting October 9-10, 2012 Barcelona, Spain Morris Jette and Danny Auble [jette,da]@schedmd.com Thanks to This work is supported by the Oak Ridge National

More information

Batch Systems. Running your jobs on an HPC machine

Batch Systems. Running your jobs on an HPC machine Batch Systems Running your jobs on an HPC machine Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

SmartSuspend. Achieve 100% Cluster Utilization. Technical Overview

SmartSuspend. Achieve 100% Cluster Utilization. Technical Overview SmartSuspend Achieve 100% Cluster Utilization Technical Overview 2011 Jaryba, Inc. SmartSuspend TM Technical Overview 1 Table of Contents 1.0 SmartSuspend Overview 3 2.0 How SmartSuspend Works 3 3.0 Job

More information

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012 Slurm and Abel job scripts Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012 Abel in numbers Nodes - 600+ Cores - 10000+ (1 node->2 processors->16 cores) Total memory

More information

Introduction to UBELIX

Introduction to UBELIX Science IT Support (ScITS) Michael Rolli, Nico Färber Informatikdienste Universität Bern 06.06.2017, Introduction to UBELIX Agenda > Introduction to UBELIX (Overview only) Other topics spread in > Introducing

More information

Univa Grid Engine Troubleshooting Quick Reference

Univa Grid Engine Troubleshooting Quick Reference Univa Corporation Grid Engine Documentation Univa Grid Engine Troubleshooting Quick Reference Author: Univa Engineering Version: 8.4.4 October 31, 2016 Copyright 2012 2016 Univa Corporation. All rights

More information

Using and Modifying the BSC Slurm Workload Simulator. Slurm User Group Meeting 2015 Stephen Trofinoff and Massimo Benini, CSCS September 16, 2015

Using and Modifying the BSC Slurm Workload Simulator. Slurm User Group Meeting 2015 Stephen Trofinoff and Massimo Benini, CSCS September 16, 2015 Using and Modifying the BSC Slurm Workload Simulator Slurm User Group Meeting 2015 Stephen Trofinoff and Massimo Benini, CSCS September 16, 2015 Using and Modifying the BSC Slurm Workload Simulator The

More information

Quick Guide for the Torque Cluster Manager

Quick Guide for the Torque Cluster Manager Quick Guide for the Torque Cluster Manager Introduction: One of the main purposes of the Aries Cluster is to accommodate especially long-running programs. Users who run long jobs (which take hours or days

More information

Introduction to the Cluster

Introduction to the Cluster Introduction to the Cluster Advanced Computing Center for Research and Education http://www.accre.vanderbilt.edu Follow us on Twitter for important news and updates: @ACCREVandy The Cluster We will be

More information

Tutorial 4: Condor. John Watt, National e-science Centre

Tutorial 4: Condor. John Watt, National e-science Centre Tutorial 4: Condor John Watt, National e-science Centre Tutorials Timetable Week Day/Time Topic Staff 3 Fri 11am Introduction to Globus J.W. 4 Fri 11am Globus Development J.W. 5 Fri 11am Globus Development

More information

UMass High Performance Computing Center

UMass High Performance Computing Center .. UMass High Performance Computing Center University of Massachusetts Medical School October, 2015 2 / 39. Challenges of Genomic Data It is getting easier and cheaper to produce bigger genomic data every

More information

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende Introduction to the NCAR HPC Systems 25 May 2018 Consulting Services Group Brian Vanderwende Topics to cover Overview of the NCAR cluster resources Basic tasks in the HPC environment Accessing pre-built

More information

PROOF-Condor integration for ATLAS

PROOF-Condor integration for ATLAS PROOF-Condor integration for ATLAS G. Ganis,, J. Iwaszkiewicz, F. Rademakers CERN / PH-SFT M. Livny, B. Mellado, Neng Xu,, Sau Lan Wu University Of Wisconsin Condor Week, Madison, 29 Apr 2 May 2008 Outline

More information

SLURM Workload and Resource Management in HPC

SLURM Workload and Resource Management in HPC SLURM Workload and Resource Management in HPC Users and Administrators Tutorial 02/07/15 Yiannis Georgiou R&D Sofware Architect Bull, 2012 1 Introduction SLURM scalable and flexible RJMS Part 1: Basics

More information

Slurm Birds of a Feather

Slurm Birds of a Feather Slurm Birds of a Feather Tim Wickberg SchedMD SC17 Outline Welcome Roadmap Review of 17.02 release (Februrary 2017) Overview of upcoming 17.11 (November 2017) release Roadmap for 18.08 and beyond Time

More information

HPC Introductory Course - Exercises

HPC Introductory Course - Exercises HPC Introductory Course - Exercises The exercises in the following sections will guide you understand and become more familiar with how to use the Balena HPC service. Lines which start with $ are commands

More information

Introduction to SLURM & SLURM batch scripts

Introduction to SLURM & SLURM batch scripts Introduction to SLURM & SLURM batch scripts Anita Orendt Assistant Director Research Consulting & Faculty Engagement anita.orendt@utah.edu 23 June 2016 Overview of Talk Basic SLURM commands SLURM batch

More information

HTCondor Essentials. Index

HTCondor Essentials. Index HTCondor Essentials 31.10.2017 Index Login How to submit a job in the HTCondor pool Why the -name option? Submitting a job Checking status of submitted jobs Getting id and other info about a job

More information

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research Introduction to SLURM on the High Performance Cluster at the Center for Computational Research Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St Buffalo, NY

More information

Bright Cluster Manager: Using the NVIDIA NGC Deep Learning Containers

Bright Cluster Manager: Using the NVIDIA NGC Deep Learning Containers Bright Cluster Manager: Using the NVIDIA NGC Deep Learning Containers Technical White Paper Table of Contents Pre-requisites...1 Setup...2 Run PyTorch in Kubernetes...3 Run PyTorch in Singularity...4 Run

More information

Slurm Version Overview

Slurm Version Overview Slurm Version 18.08 Overview Brian Christiansen SchedMD Slurm User Group Meeting 2018 Schedule Previous major release was 17.11 (November 2017) Latest major release 18.08 (August 2018) Next major release

More information

Highly Available Forms and Reports Applications with Oracle Fail Safe 3.0

Highly Available Forms and Reports Applications with Oracle Fail Safe 3.0 Highly Available Forms and Reports Applications with Oracle Fail Safe 3.0 High Availability for Windows NT An Oracle Technical White Paper Robert Cheng Oracle New England Development Center System Products

More information

Using Docker in High Performance Computing in OpenPOWER Environment

Using Docker in High Performance Computing in OpenPOWER Environment Using Docker in High Performance Computing in OpenPOWER Environment Zhaohui Ding, Senior Product Architect Sam Sanjabi, Advisory Software Engineer IBM Platform Computing #OpenPOWERSummit Join the conversation

More information

Workload management at KEK/CRC -- status and plan

Workload management at KEK/CRC -- status and plan Workload management at KEK/CRC -- status and plan KEK/CRC Hiroyuki Matsunaga Most of the slides are prepared by Koichi Murakami and Go Iwai CPU in KEKCC Work server & Batch server Xeon 5670 (2.93 GHz /

More information

Announcement. Exercise #2 will be out today. Due date is next Monday

Announcement. Exercise #2 will be out today. Due date is next Monday Announcement Exercise #2 will be out today Due date is next Monday Major OS Developments 2 Evolution of Operating Systems Generations include: Serial Processing Simple Batch Systems Multiprogrammed Batch

More information

Introduction to RCC. September 14, 2016 Research Computing Center

Introduction to RCC. September 14, 2016 Research Computing Center Introduction to HPC @ RCC September 14, 2016 Research Computing Center What is HPC High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers

More information

Introduction to BioHPC

Introduction to BioHPC Introduction to BioHPC New User Training [web] [email] portal.biohpc.swmed.edu biohpc-help@utsouthwestern.edu 1 Updated for 2015-06-03 Overview Today we re going to cover: What is BioHPC? How do I access

More information

Introduction to the Cluster

Introduction to the Cluster Follow us on Twitter for important news and updates: @ACCREVandy Introduction to the Cluster Advanced Computing Center for Research and Education http://www.accre.vanderbilt.edu The Cluster We will be

More information

Introduction to RCC. January 18, 2017 Research Computing Center

Introduction to RCC. January 18, 2017 Research Computing Center Introduction to HPC @ RCC January 18, 2017 Research Computing Center What is HPC High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much

More information

CRUK cluster practical sessions (SLURM) Part I processes & scripts

CRUK cluster practical sessions (SLURM) Part I processes & scripts CRUK cluster practical sessions (SLURM) Part I processes & scripts login Log in to the head node, clust1-headnode, using ssh and your usual user name & password. SSH Secure Shell 3.2.9 (Build 283) Copyright

More information

Bright Cluster Manager

Bright Cluster Manager Bright Cluster Manager Using Slurm for Data Aware Scheduling in the Cloud Martijn de Vries CTO About Bright Computing Bright Computing 1. Develops and supports Bright Cluster Manager for HPC systems, server

More information

Day 9: Introduction to CHTC

Day 9: Introduction to CHTC Day 9: Introduction to CHTC Suggested reading: Condor 7.7 Manual: http://www.cs.wisc.edu/condor/manual/v7.7/ Chapter 1: Overview Chapter 2: Users Manual (at most, 2.1 2.7) 1 Turn In Homework 2 Homework

More information

Duke Compute Cluster Workshop. 10/04/2018 Tom Milledge rc.duke.edu

Duke Compute Cluster Workshop. 10/04/2018 Tom Milledge rc.duke.edu Duke Compute Cluster Workshop 10/04/2018 Tom Milledge rc.duke.edu rescomputing@duke.edu Outline of talk Overview of Research Computing resources Duke Compute Cluster overview Running interactive and batch

More information

Job Management System Extension To Support SLAAC-1V Reconfigurable Hardware

Job Management System Extension To Support SLAAC-1V Reconfigurable Hardware Job Management System Extension To Support SLAAC-1V Reconfigurable Hardware Mohamed Taher 1, Kris Gaj 2, Tarek El-Ghazawi 1, and Nikitas Alexandridis 1 1 The George Washington University 2 George Mason

More information

Moab Passthrough. Administrator Guide February 2018

Moab Passthrough. Administrator Guide February 2018 Moab Passthrough Administrator Guide 9.1.2 February 2018 2018 Adaptive Computing Enterprises, Inc. All rights reserved. Distribution of this document for commercial purposes in either hard or soft copy

More information

Get your own Galaxy within minutes

Get your own Galaxy within minutes Get your own Galaxy within minutes Enis Afgan, Nitesh Turaga, Nuwan Goonasekera GCC 2016 Bloomington, IN Access slides from bit.ly/gcc2016_usecloud Today s agenda Introduction Hands on, part 1 Launch your

More information

PBS Pro Documentation

PBS Pro Documentation Introduction Most jobs will require greater resources than are available on individual nodes. All jobs must be scheduled via the batch job system. The batch job system in use is PBS Pro. Jobs are submitted

More information

CycleServer Grid Engine Support Install Guide. version

CycleServer Grid Engine Support Install Guide. version CycleServer Grid Engine Support Install Guide version 1.34.4 Contents CycleServer Grid Engine Guide 1 Administration 1 Requirements 1 Installation 1 Monitoring Additional Grid Engine Clusters 3 Monitoring

More information

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017 CS 471 Operating Systems Yue Cheng George Mason University Fall 2017 Outline o Process concept o Process creation o Process states and scheduling o Preemption and context switch o Inter-process communication

More information

Troubleshooting Jobs on Odyssey

Troubleshooting Jobs on Odyssey Troubleshooting Jobs on Odyssey Paul Edmon, PhD ITC Research CompuGng Associate Bob Freeman, PhD Research & EducaGon Facilitator XSEDE Campus Champion Goals Tackle PEND, FAIL, and slow performance issues

More information

Andrej Filipčič

Andrej Filipčič Singularity@SiGNET Andrej Filipčič SiGNET 4.5k cores, 3PB storage, 4.8.17 kernel on WNs and Gentoo host OS 2 ARC-CEs with 700TB cephfs ARC cache and 3 data delivery nodes for input/output file staging

More information

Installing and Configuring VMware Identity Manager Connector (Windows) OCT 2018 VMware Identity Manager VMware Identity Manager 3.

Installing and Configuring VMware Identity Manager Connector (Windows) OCT 2018 VMware Identity Manager VMware Identity Manager 3. Installing and Configuring VMware Identity Manager Connector 2018.8.1.0 (Windows) OCT 2018 VMware Identity Manager VMware Identity Manager 3.3 You can find the most up-to-date technical documentation on

More information

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK RHRK-Seminar High Performance Computing with the Cluster Elwetritsch - II Course instructor : Dr. Josef Schüle, RHRK Overview Course I Login to cluster SSH RDP / NX Desktop Environments GNOME (default)

More information

Chapter 8. Operating System Support. Yonsei University

Chapter 8. Operating System Support. Yonsei University Chapter 8 Operating System Support Contents Operating System Overview Scheduling Memory Management Pentium II and PowerPC Memory Management 8-2 OS Objectives & Functions OS is a program that Manages the

More information

07 - Processes and Jobs

07 - Processes and Jobs 07 - Processes and Jobs CS 2043: Unix Tools and Scripting, Spring 2016 [1] Stephen McDowell February 10th, 2016 Cornell University Table of contents 1. Processes Overview 2. Modifying Processes 3. Jobs

More information

Job Management on LONI and LSU HPC clusters

Job Management on LONI and LSU HPC clusters Job Management on LONI and LSU HPC clusters Le Yan HPC Consultant User Services @ LONI Outline Overview Batch queuing system Job queues on LONI clusters Basic commands The Cluster Environment Multiple

More information

Directions in Workload Management

Directions in Workload Management Directions in Workload Management Alex Sanchez and Morris Jette SchedMD LLC HPC Knowledge Meeting 2016 Areas of Focus Scalability Large Node and Core Counts Power Management Failure Management Federated

More information

Slurm Workload Manager Introductory User Training

Slurm Workload Manager Introductory User Training Slurm Workload Manager Introductory User Training David Bigagli david@schedmd.com SchedMD LLC Outline Roles of resource manager and job scheduler Slurm design and architecture Submitting and running jobs

More information

IRIX Resource Management Plans & Status

IRIX Resource Management Plans & Status IRIX Resource Management Plans & Status Dan Higgins Engineering Manager, Resource Management Team, SGI E-mail: djh@sgi.com CUG Minneapolis, May 1999 Abstract This paper will detail what work has been done

More information

OVERVIEW OF THE SAS GRID

OVERVIEW OF THE SAS GRID OVERVIEW OF THE SAS GRID Host Caroline Scottow Presenter Peter Hobart MANAGING THE WEBINAR In Listen Mode Control bar opened with the white arrow in the orange box Copyr i g ht 2012, SAS Ins titut e Inc.

More information

HPC Introductory Training. on Balena by Team Bath

HPC Introductory Training. on Balena by Team Bath HPC Introductory Training on Balena by Team HPC @ Bath What is HPC and why is it different to using your desktop? High Performance Computing most generally refers to the practice of aggregating computing

More information

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers LAB Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012 1 Discovery

More information

Testing SLURM open source batch system for a Tierl/Tier2 HEP computing facility

Testing SLURM open source batch system for a Tierl/Tier2 HEP computing facility Journal of Physics: Conference Series OPEN ACCESS Testing SLURM open source batch system for a Tierl/Tier2 HEP computing facility Recent citations - A new Self-Adaptive dispatching System for local clusters

More information

and how to use TORQUE & Maui Piero Calucci

and how to use TORQUE & Maui Piero Calucci Queue and how to use & Maui Scuola Internazionale Superiore di Studi Avanzati Trieste November 2008 Advanced School in High Performance and Grid Computing Outline 1 We Are Trying to Solve 2 Using the Manager

More information

COPYRIGHTED MATERIAL. Introducing VMware Infrastructure 3. Chapter 1

COPYRIGHTED MATERIAL. Introducing VMware Infrastructure 3. Chapter 1 Mccain c01.tex V3-04/16/2008 5:22am Page 1 Chapter 1 Introducing VMware Infrastructure 3 VMware Infrastructure 3 (VI3) is the most widely used virtualization platform available today. The lineup of products

More information

Queue systems. and how to use Torque/Maui. Piero Calucci. Scuola Internazionale Superiore di Studi Avanzati Trieste

Queue systems. and how to use Torque/Maui. Piero Calucci. Scuola Internazionale Superiore di Studi Avanzati Trieste Queue systems and how to use Torque/Maui Piero Calucci Scuola Internazionale Superiore di Studi Avanzati Trieste March 9th 2007 Advanced School in High Performance Computing Tools for e-science Outline

More information

Hosts & Partitions. Slurm Training 15. Jordi Blasco & Alfred Gil (HPCNow!)

Hosts & Partitions. Slurm Training 15. Jordi Blasco & Alfred Gil (HPCNow!) Slurm Training 15 Agenda 1 2 Compute Hosts State of the node FrontEnd Hosts FrontEnd Hosts Control Machine Define Partitions Job Preemption 3 4 Define Limits Define ACLs Shared resources Partition States

More information

Design and deliver cloud-based apps and data for flexible, on-demand IT

Design and deliver cloud-based apps and data for flexible, on-demand IT White Paper Design and deliver cloud-based apps and data for flexible, on-demand IT Design and deliver cloud-based apps and data for flexible, on-demand IT Discover the fastest and easiest way for IT to

More information

An introduction to checkpointing. for scientific applications

An introduction to checkpointing. for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI An introduction to checkpointing for scientific applications November 2013 CISM/CÉCI training session What is checkpointing? Without checkpointing: $./count

More information

Slurm Support for Linux Control Groups

Slurm Support for Linux Control Groups Slurm Support for Linux Control Groups Slurm User Group 2010, Paris, France, Oct 5 th 2010 Martin Perry Bull Information Systems Phoenix, Arizona martin.perry@bull.com cgroups Concepts Control Groups (cgroups)

More information

Introduction to Operating Systems (Part II)

Introduction to Operating Systems (Part II) Introduction to Operating Systems (Part II) Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Introduction 1393/6/24 1 / 45 Computer

More information

HPC Introductory Training. on Balena by Team Bath

HPC Introductory Training. on Balena by Team Bath HPC Introductory Training on Balena by Team HPC @ Bath Housekeeping Attendance sheet Fire alarm Refreshment breaks Questions anytime lets us know if you need any assistance. Feedback at the end of the

More information

Using Cartesius and Lisa. Zheng Meyer-Zhao - Consultant Clustercomputing

Using Cartesius and Lisa. Zheng Meyer-Zhao - Consultant Clustercomputing Zheng Meyer-Zhao - zheng.meyer-zhao@surfsara.nl Consultant Clustercomputing Outline SURFsara About us What we do Cartesius and Lisa Architectures and Specifications File systems Funding Hands-on Logging

More information