Advanced cluster techniques with LoadLeveler

Size: px
Start display at page:

Download "Advanced cluster techniques with LoadLeveler"

Transcription

1 Advanced cluster techniques with LoadLeveler How to get your jobs to the top of the queue Ciaron Linstead 10th May 2012

2 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 2

3 Introduction Resources on the cluster Job priority Using multiple jobsteps Useful techniques and new tools Ciaron Linstead IT Services 3

4 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 4

5 LoadLeveler Recap Schedules workload by matching jobs to available resources Typical workflow: Write a Job Command File (JCF) a shell script with LL-specific instructions ( ) use llsubmit to start a run llsubmit example.jcf check progress with llq [-l] check cluster load with llstatus [-l] check class load with llclass [-l] Ciaron Linstead IT Services 5

6 The Job Command File Simple serial (one-task) example 1 #!/bin/bash 2 job_name = hello_world 3 class = short 4 group = its 5 notify_user = linstead 6 output = /scratch/01/$(user)/$(job_name)_$(cluster)_$(stepid).out 7 error = /scratch/01/$(user)/$(job_name)_$(cluster)_$(stepid).err 8 queue 9 10 time /home/linstead/examples/c/hello Lines 6 and 7 use variables to send output/error from different runs to different files Ciaron Linstead IT Services 6

7 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 7

8 CPU layout on the idataplex cluster 320 machines (nodes) with 8 CPUs each 2x 4-core Intel Xeon processors 1 task gets 1 CPU Ciaron Linstead IT Services 8

9 Mapping tasks to nodes Parallel applications: layout can help performance Minimise network connections with dense packing of tasks on nodes Tasks on the same node use (faster) shared memory to communicate Maximise memory or disk IO bandwidth per task by sparse packing (and perhaps not sharing the node with other user s jobs) Ciaron Linstead IT Services 9

10 Method 1: total tasks and blocking total tasks = 24...I have this many tasks/processes blocking = unlimited...and I don t care where they re located or blocking = 4...put at most 4 of my tasks on a node Ciaron Linstead IT Services 10

11 Method 2: node and tasks per node node = 3...I want this many nodes tasks per node = 8...and this many tasks per node Same as total tasks = 24 && blocking = 8 Ciaron Linstead IT Services 11

12 Method 3: task geometry Useful if I want to take advantage of the communication pattern of my program Tasks on the same node use shared memory instead of the network to communicate e.g. six tasks on four different nodes: task_geometry={(0,1) (3) (5,4) (2)} Ciaron Linstead IT Services 12

13 shared vs. not shared nodes Nodes share memory, network and IO bandwidth I can specify that I need all the resources! node_usage = not_shared or using LoadLeveler resources keyword ConsumableCpus means the number of CPUs each task needs resources = ConsumableCpus(8) Ciaron Linstead IT Services 13

14 unshared nodes lead to: long queue times (LL needs to reserve entire nodes) low overall utilisation of the cluster (bad for our statistics!) Ciaron Linstead IT Services 14

15 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 15

16 Physical memory Total RAM per node: 32GB (minus operating system) = 28GB default: 3.5GB per core (28/8) largemem class: 14GB for each of 2 cores (6 cores idle) set in system configuration with ulimit Ciaron Linstead IT Services 16

17 Available memory Linux will kill processes if the node runs out of memory (OOM-killer) Sometimes includes the LoadLeveler Starter Daemon Very bad things happen on an interactive (login) node (filesystem daemons, LoadLeveler daemons disappearing) limited per-process memory with ulimit malloc-like functions will fail if ulimit bound is reached check return values R loads workspaces on startup from.rdata (use no-restore-data or no-restore) Ciaron Linstead IT Services 17

18 LoadLeveler s Consumable Resources: Memory resources = ConsumableMemory(count) Doesn t enforce limits (unlike ulimit) Just use the default Ciaron Linstead IT Services 18

19 Measure memory usage valgrind/massif, gprof (C, Fortran) Intel VTune memory profiler, Heapy, PySizer (Python) Ciaron Linstead IT Services 19

20 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 20

21 How LoadLeveler calculates job priority jobs are dispatched based on priority, but can be out-of-order, depending on resources SYSPRIO = (ClassSysprio * 100) + (UserSysprio * 10) + (GroupSysprio * 1) - (GroupRunningJobs) - (UserTotalJobs) Class priority goes from short (high) to long (low) All Users (and Groups) have equal priority You can prioritise your own jobs: user priority = n (0-100, default 50) re-prioritise queued jobs with llprio Ciaron Linstead IT Services 21

22 How you can influence job dispatch time Use a different class shorter running classes have higher priority Also important: wall clock limit defaults for short, medium, long are 1, 7 and 30 days jobs are stopped (SIGTERM) at the limit New (lower) limit can be set: wall clock limit = HH:MM:SS Ciaron Linstead IT Services 22

23 Aside: how the backfill scheduler works runs jobs out-of-order according to available resources jobs have a known start time and the wall clock limit so LoadLeveler knows the latest start time of the highest priority queued job (Could be earlier, if jobs finish before wall clock limit reached) LL won t start lower priority jobs if they would delay the start of the highest priority job (the top-dog )...even if there are unused resources Ciaron Linstead IT Services 23

24 Aside: how the backfill scheduler works Job 4 sets a lower wall-clock-limit than the default and can be backfilled. Ciaron Linstead IT Services 24

25 How you can influence job dispatch time Use a lower wall clock limit: wall_clock_limit = HH:MM:SS Ciaron Linstead IT Services 25

26 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 26

27 Using multiple jobsteps run a program multiple times with different input/output data with one JCF or do data-staging or post-processing, e.g. 1st jobstep: Use class io with 1 task to fetch archived input data 2nd jobstep: Use class short with n tasks to do model run 3rd jobstep: Use class io with 1 task to archive output data Ciaron Linstead IT Services 27

28 Multi-step jobs: Independent jobsteps Run multiple independent steps with one JCF 1 executable = longjob 2 input = longjob.in.$(stepid) 3 output = longjob.out.$(jobid).$(stepid) 4 error = longjob.err.$(jobid).$(stepid) 5 queue 6 queue 7 queue 8 queue 9 queue (Use $(stepid) to differentiate input, output and error files) Ciaron Linstead IT Services 28

29 Multi-step jobs: Dependent jobsteps Run job steps with dependencies on previous steps 1 step_name = step1 2 executable = executable1 3 input = step1.in1 4 output = step1.out1 5 error = step1.err1 6 queue 7 dependency = (step1 == 0) 8 step_name = step2 9 input = step2.in1 10 output = step2.out1 11 error = step2.err1 12 queue (Both steps use the same executable) Ciaron Linstead IT Services 29

30 Multi-step jobs: Dependent jobsteps Run job steps with dependencies on previous steps, different executables (lines 2 and 8) 1 step_name = step1 2 executable = executable1 3 input = step1.in1 4 output = step1.out1 5 error = step1.err1 6 queue 7 dependency = (step1 == 0) 8 executable = executable2 9 step_name = step2 10 input = step2.in1 11 output = step2.out1 12 error = step2.err1 13 queue Status indicators in llq: [C]ompleted, [N]ot[Q]ueued Ciaron Linstead IT Services 30

31 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 31

32 Common errors Most errors cause submission to fail: llsubmit: Class "short" is not valid for group "itss". llsubmit: This job has not been submitted to LoadLeveler. Job stays Idle cws02a linstead 5/7 11:06 I 50 short Waiting for resources, e.g. ConsumableCpus=9 Job keeps switching between [I]dle and [ST]arting cws02a linstead 5/7 11:06 I 50 short cws02a linstead 5/7 11:06 ST 50 short Checking for input, e.g. executable = /file/doesn t/exist Ciaron Linstead IT Services 32

33 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 33

34 Watch out for: unwanted restarts: Set restart = no prevent LoadLeveler from restarting your job after a machine failure...unless your code can cope with a restart (e.g. overwriting output files) what your application does with SIGTERM LL uses SIGTERM to cancel jobs some models trap SIGTERM to clean up but don t exit LoadLeveler gets confused Ciaron Linstead IT Services 34

35 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 35

36 New tools Python 2.7 and 3.2 Python pip and virtualenv for installing your own packages (2.7 only) Distributed and Parallel Matlab (with 16 worker licences) Submit Matlab tasks to compute nodes instead of running on login nodes No need to keep the Matlab client open Intel VTune with sampling driver on login01 (profile code hotspots/cache misses/branch mispredicts) Ciaron Linstead IT Services 36

37 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 37

38 Summary Default resources Specify job requirements for better performance Use multiple jobsteps for more flexible runs New tools and useful options Ciaron Linstead IT Services 38

39 New cluster 2014 Bid invitation preparation: mid-2013 What do you like about the cluster? What do you dislike? What s on your wishlist? Ciaron Linstead IT Services 39

40 Thank you! Ciaron Linstead IT Services 40

41 References Cluster documentation (inc. these slides): TWS LoadLeveler - Using and Administering: pik-potsdam.de/members/linstead/documentation IBM Redbook - Workload Management with LoadLeveler: documentation Distributed Matlab at PIK http: // Python memory profiler: /line-by-line-report-of-memory-usage/ Valgrind: Ciaron Linstead IT Services 41

Job Management on LONI and LSU HPC clusters

Job Management on LONI and LSU HPC clusters Job Management on LONI and LSU HPC clusters Le Yan HPC Consultant User Services @ LONI Outline Overview Batch queuing system Job queues on LONI clusters Basic commands The Cluster Environment Multiple

More information

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike

More information

Blue Gene/Q User Workshop. User Environment & Job submission

Blue Gene/Q User Workshop. User Environment & Job submission Blue Gene/Q User Workshop User Environment & Job submission Topics Blue Joule User Environment Loadleveler Task Placement & BG/Q Personality 2 Blue Joule User Accounts Home directories organised on a project

More information

Batch Systems. Running your jobs on an HPC machine

Batch Systems. Running your jobs on an HPC machine Batch Systems Running your jobs on an HPC machine Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Introduction to the Cluster

Introduction to the Cluster Introduction to the Cluster Advanced Computing Center for Research and Education http://www.accre.vanderbilt.edu Follow us on Twitter for important news and updates: @ACCREVandy The Cluster We will be

More information

Tivoli Workload Scheduler LoadLeveler V3.4.2 and V3.4.1 documentation updates

Tivoli Workload Scheduler LoadLeveler V3.4.2 and V3.4.1 documentation updates Tivoli Workload Scheduler LoadLeveler V3.4.2 and V3.4.1 documentation updates This file contains updates to the IBM Tivoli Workload Scheduler (TWS) LoadLeveler Version 3.4 documentation. v TWS LoadLeveler:

More information

Introduction to the Cluster

Introduction to the Cluster Follow us on Twitter for important news and updates: @ACCREVandy Introduction to the Cluster Advanced Computing Center for Research and Education http://www.accre.vanderbilt.edu The Cluster We will be

More information

Leibniz Supercomputer Centre. Movie on YouTube

Leibniz Supercomputer Centre. Movie on YouTube SuperMUC @ Leibniz Supercomputer Centre Movie on YouTube Peak Performance Peak performance: 3 Peta Flops 3*10 15 Flops Mega 10 6 million Giga 10 9 billion Tera 10 12 trillion Peta 10 15 quadrillion Exa

More information

Introduction to HPC Numerical libraries on FERMI and PLX

Introduction to HPC Numerical libraries on FERMI and PLX Introduction to HPC Numerical libraries on FERMI and PLX HPC Numerical Libraries 11-12-13 March 2013 a.marani@cineca.it WELCOME!! The goal of this course is to show you how to get advantage of some of

More information

Batch Systems. Running calculations on HPC resources

Batch Systems. Running calculations on HPC resources Batch Systems Running calculations on HPC resources Outline What is a batch system? How do I interact with the batch system Job submission scripts Interactive jobs Common batch systems Converting between

More information

Introduction to HPC Using zcluster at GACRC

Introduction to HPC Using zcluster at GACRC Introduction to HPC Using zcluster at GACRC On-class STAT8330 Georgia Advanced Computing Resource Center University of Georgia Suchitra Pakala pakala@uga.edu Slides courtesy: Zhoufei Hou 1 Outline What

More information

IBM Scheduler for High Throughput Computing on IBM Blue Gene /P Table of Contents

IBM Scheduler for High Throughput Computing on IBM Blue Gene /P Table of Contents IBM Scheduler for High Throughput Computing on IBM Blue Gene /P Table of Contents Introduction...3 Architecture...4 simple_sched daemon...4 startd daemon...4 End-user commands...4 Personal HTC Scheduler...6

More information

Introduction to PICO Parallel & Production Enviroment

Introduction to PICO Parallel & Production Enviroment Introduction to PICO Parallel & Production Enviroment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Domenico Guida d.guida@cineca.it Nicola Spallanzani n.spallanzani@cineca.it

More information

Martinos Center Compute Cluster

Martinos Center Compute Cluster Why-N-How: Intro to Launchpad 8 September 2016 Lee Tirrell Laboratory for Computational Neuroimaging Adapted from slides by Jon Kaiser 1. Intro 2. Using launchpad 3. Summary 4. Appendix: Miscellaneous

More information

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide Introduction What are the intended uses of the MTL? The MTL is prioritized for supporting the Intel Academic Community for the testing, validation

More information

Introduction to Discovery.

Introduction to Discovery. Introduction to Discovery http://discovery.dartmouth.edu The Discovery Cluster 2 Agenda What is a cluster and why use it Overview of computer hardware in cluster Help Available to Discovery Users Logging

More information

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission Process Heechul Yun Disclaimer: some slides are adopted from the book authors slides with permission 1 Recap OS services Resource (CPU, memory) allocation, filesystem, communication, protection, security,

More information

Programming Techniques for Supercomputers. HPC RRZE University Erlangen-Nürnberg Sommersemester 2018

Programming Techniques for Supercomputers. HPC RRZE University Erlangen-Nürnberg Sommersemester 2018 Programming Techniques for Supercomputers HPC Services @ RRZE University Erlangen-Nürnberg Sommersemester 2018 Outline Login to RRZE s Emmy cluster Basic environment Some guidelines First Assignment 2

More information

Introduction to HPC Using the New Cluster at GACRC

Introduction to HPC Using the New Cluster at GACRC Introduction to HPC Using the New Cluster at GACRC Georgia Advanced Computing Resource Center University of Georgia Zhuofei Hou, HPC Trainer zhuofei@uga.edu Outline What is GACRC? What is the new cluster

More information

ECE 598 Advanced Operating Systems Lecture 22

ECE 598 Advanced Operating Systems Lecture 22 ECE 598 Advanced Operating Systems Lecture 22 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 19 April 2016 Announcements Project update HW#9 posted, a bit late Midterm next Thursday

More information

Introduction to GALILEO

Introduction to GALILEO Introduction to GALILEO Parallel & production environment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Domenico Guida d.guida@cineca.it Maurizio Cremonesi m.cremonesi@cineca.it

More information

Moab Workload Manager on Cray XT3

Moab Workload Manager on Cray XT3 Moab Workload Manager on Cray XT3 presented by Don Maxwell (ORNL) Michael Jackson (Cluster Resources, Inc.) MOAB Workload Manager on Cray XT3 Why MOAB? Requirements Features Support/Futures 2 Why Moab?

More information

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop Before We Start Sign in hpcxx account slips Windows Users: Download PuTTY Google PuTTY First result Save putty.exe to Desktop Research Computing at Virginia Tech Advanced Research Computing Compute Resources

More information

NBIC TechTrack PBS Tutorial

NBIC TechTrack PBS Tutorial NBIC TechTrack PBS Tutorial by Marcel Kempenaar, NBIC Bioinformatics Research Support group, University Medical Center Groningen Visit our webpage at: http://www.nbic.nl/support/brs 1 NBIC PBS Tutorial

More information

OpenPBS Users Manual

OpenPBS Users Manual How to Write a PBS Batch Script OpenPBS Users Manual PBS scripts are rather simple. An MPI example for user your-user-name: Example: MPI Code PBS -N a_name_for_my_parallel_job PBS -l nodes=7,walltime=1:00:00

More information

Submitting batch jobs

Submitting batch jobs Submitting batch jobs SLURM on ECGATE Xavi Abellan Xavier.Abellan@ecmwf.int ECMWF February 20, 2017 Outline Interactive mode versus Batch mode Overview of the Slurm batch system on ecgate Batch basic concepts

More information

Introduction to HPC Using zcluster at GACRC

Introduction to HPC Using zcluster at GACRC Introduction to HPC Using zcluster at GACRC On-class PBIO/BINF8350 Georgia Advanced Computing Resource Center University of Georgia Zhuofei Hou, HPC Trainer zhuofei@uga.edu Outline What is GACRC? What

More information

MERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced

MERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced MERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced Sarvani Chadalapaka HPC Administrator University of California

More information

Introduction to Discovery.

Introduction to Discovery. Introduction to Discovery http://discovery.dartmouth.edu The Discovery Cluster 2 Agenda What is a cluster and why use it Overview of computer hardware in cluster Help Available to Discovery Users Logging

More information

and how to use TORQUE & Maui Piero Calucci

and how to use TORQUE & Maui Piero Calucci Queue and how to use & Maui Scuola Internazionale Superiore di Studi Avanzati Trieste November 2008 Advanced School in High Performance and Grid Computing Outline 1 We Are Trying to Solve 2 Using the Manager

More information

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission 1

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission 1 Process Heechul Yun Disclaimer: some slides are adopted from the book authors slides with permission 1 Recap OS services Resource (CPU, memory) allocation, filesystem, communication, protection, security,

More information

Announcement. Exercise #2 will be out today. Due date is next Monday

Announcement. Exercise #2 will be out today. Due date is next Monday Announcement Exercise #2 will be out today Due date is next Monday Major OS Developments 2 Evolution of Operating Systems Generations include: Serial Processing Simple Batch Systems Multiprogrammed Batch

More information

Introduction to High-Performance Computing (HPC)

Introduction to High-Performance Computing (HPC) Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit cores : individual processing units within a CPU Storage : Disk drives HDD : Hard Disk Drive SSD : Solid

More information

Queuing and Scheduling on Compute Clusters

Queuing and Scheduling on Compute Clusters Queuing and Scheduling on Compute Clusters Andrew Caird acaird@umich.edu Queuing and Scheduling on Compute Clusters p.1/17 The reason for me being here Give some queuing background Introduce some queuing

More information

ECE 574 Cluster Computing Lecture 4

ECE 574 Cluster Computing Lecture 4 ECE 574 Cluster Computing Lecture 4 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 31 January 2017 Announcements Don t forget about homework #3 I ran HPCG benchmark on Haswell-EP

More information

CS 326: Operating Systems. CPU Scheduling. Lecture 6

CS 326: Operating Systems. CPU Scheduling. Lecture 6 CS 326: Operating Systems CPU Scheduling Lecture 6 Today s Schedule Agenda? Context Switches and Interrupts Basic Scheduling Algorithms Scheduling with I/O Symmetric multiprocessing 2/7/18 CS 326: Operating

More information

Announcements. Reading. Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) CMSC 412 S14 (lect 5)

Announcements. Reading. Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) CMSC 412 S14 (lect 5) Announcements Reading Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) 1 Relationship between Kernel mod and User Mode User Process Kernel System Calls User Process

More information

Submitting batch jobs Slurm on ecgate Solutions to the practicals

Submitting batch jobs Slurm on ecgate Solutions to the practicals Submitting batch jobs Slurm on ecgate Solutions to the practicals Xavi Abellan xavier.abellan@ecmwf.int User Support Section Com Intro 2015 Submitting batch jobs ECMWF 2015 Slide 1 Practical 1: Basic job

More information

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012 Slurm and Abel job scripts Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012 Abel in numbers Nodes - 600+ Cores - 10000+ (1 node->2 processors->16 cores) Total memory

More information

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University Frequently asked questions from the previous class survey CS 370: SYSTEM ARCHITECTURE & SOFTWARE [CPU SCHEDULING] Shrideep Pallickara Computer Science Colorado State University OpenMP compiler directives

More information

A Brief Introduction to The Center for Advanced Computing

A Brief Introduction to The Center for Advanced Computing A Brief Introduction to The Center for Advanced Computing May 1, 2006 Hardware 324 Opteron nodes, over 700 cores 105 Athlon nodes, 210 cores 64 Apple nodes, 128 cores Gigabit networking, Myrinet networking,

More information

Submitting batch jobs Slurm on ecgate

Submitting batch jobs Slurm on ecgate Submitting batch jobs Slurm on ecgate Xavi Abellan xavier.abellan@ecmwf.int User Support Section Com Intro 2015 Submitting batch jobs ECMWF 2015 Slide 1 Outline Interactive mode versus Batch mode Overview

More information

Learning Outcomes. Processes and Threads. Major Requirements of an Operating System. Processes and Threads

Learning Outcomes. Processes and Threads. Major Requirements of an Operating System. Processes and Threads Learning Outcomes Processes and Threads An understanding of fundamental concepts of processes and threads 1 2 Major Requirements of an Operating System Interleave the execution of several processes to

More information

Application and System Memory Use, Configuration, and Problems on Bassi. Richard Gerber

Application and System Memory Use, Configuration, and Problems on Bassi. Richard Gerber Application and System Memory Use, Configuration, and Problems on Bassi Richard Gerber Lawrence Berkeley National Laboratory NERSC User Services ScicomP 13, Garching, Germany, July 17, 2007 NERSC is supported

More information

An introduction to checkpointing. for scientifc applications

An introduction to checkpointing. for scientifc applications damien.francois@uclouvain.be UCL/CISM An introduction to checkpointing for scientifc applications November 2016 CISM/CÉCI training session What is checkpointing? Without checkpointing: $./count 1 2 3^C

More information

Answers to Federal Reserve Questions. Training for University of Richmond

Answers to Federal Reserve Questions. Training for University of Richmond Answers to Federal Reserve Questions Training for University of Richmond 2 Agenda Cluster Overview Software Modules PBS/Torque Ganglia ACT Utils 3 Cluster overview Systems switch ipmi switch 1x head node

More information

Cluster Network Products

Cluster Network Products Cluster Network Products Cluster interconnects include, among others: Gigabit Ethernet Myrinet Quadrics InfiniBand 1 Interconnects in Top500 list 11/2009 2 Interconnects in Top500 list 11/2008 3 Cluster

More information

A Brief Introduction to The Center for Advanced Computing

A Brief Introduction to The Center for Advanced Computing A Brief Introduction to The Center for Advanced Computing February 8, 2007 Hardware 376 Opteron nodes, over 890 cores Gigabit networking, Myrinet networking, Infiniband networking soon Hardware: nyx nyx

More information

Sherlock for IBIIS. William Law Stanford Research Computing

Sherlock for IBIIS. William Law Stanford Research Computing Sherlock for IBIIS William Law Stanford Research Computing Overview How we can help System overview Tech specs Signing on Batch submission Software environment Interactive jobs Next steps We are here to

More information

Guillimin HPC Users Meeting March 16, 2017

Guillimin HPC Users Meeting March 16, 2017 Guillimin HPC Users Meeting March 16, 2017 guillimin@calculquebec.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Please be kind to your fellow user meeting attendees Limit to

More information

Knights Landing production environment on MARCONI

Knights Landing production environment on MARCONI Knights Landing production environment on MARCONI Alessandro Marani - a.marani@cineca.it March 20th, 2017 Agenda In this presentation, we will discuss - How we interact with KNL environment on MARCONI

More information

Introduction to HPC Using zcluster at GACRC On-Class GENE 4220

Introduction to HPC Using zcluster at GACRC On-Class GENE 4220 Introduction to HPC Using zcluster at GACRC On-Class GENE 4220 Georgia Advanced Computing Resource Center University of Georgia Suchitra Pakala pakala@uga.edu Slides courtesy: Zhoufei Hou 1 OVERVIEW GACRC

More information

Getting Started with Serial and Parallel MATLAB on bwgrid

Getting Started with Serial and Parallel MATLAB on bwgrid Getting Started with Serial and Parallel MATLAB on bwgrid CONFIGURATION Download either bwgrid.remote.r2014b.zip (Windows) or bwgrid.remote.r2014b.tar (Linux/Mac) For Windows users, unzip the download

More information

High Performance Computing (HPC) Using zcluster at GACRC

High Performance Computing (HPC) Using zcluster at GACRC High Performance Computing (HPC) Using zcluster at GACRC On-class STAT8060 Georgia Advanced Computing Resource Center University of Georgia Zhuofei Hou, HPC Trainer zhuofei@uga.edu Outline What is GACRC?

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

SGE 6.0 configuration guide, version 1.1

SGE 6.0 configuration guide, version 1.1 SGE 6.0 configuration guide, version 1.1 Juha Jäykkä juolja@utu.fi Department of Physics Laboratory of Theoretical Physics University of Turku 18.03.2005 First, some notes This needs to be revised to include

More information

Assignment #4 due 10/21/04

Assignment #4 due 10/21/04 10.675 Assignment #4 due 10/21/04 In this problem set, you will use Car-Parrinello Molecular Dynamics (CPMD) to calculate the adsorption energy of oxygen atom on the Si(100) surface and compare it to the

More information

A Brief Introduction to The Center for Advanced Computing

A Brief Introduction to The Center for Advanced Computing A Brief Introduction to The Center for Advanced Computing November 10, 2009 Outline 1 Resources Hardware Software 2 Mechanics: Access Transferring files and data to and from the clusters Logging into the

More information

Mascot Insight Installation and setup guide

Mascot Insight Installation and setup guide Mascot Insight Installation and setup guide System requirements These are the system requirements for a Mascot Insight server. On the client side, Mascot Insight can be accessed from most web browsers.

More information

Bright Cluster Manager

Bright Cluster Manager Bright Cluster Manager Using Slurm for Data Aware Scheduling in the Cloud Martijn de Vries CTO About Bright Computing Bright Computing 1. Develops and supports Bright Cluster Manager for HPC systems, server

More information

Processes and Threads

Processes and Threads Processes and Threads 1 Learning Outcomes An understanding of fundamental concepts of processes and threads 2 Major Requirements of an Operating System Interleave the execution of several processes to

More information

Queue systems. and how to use Torque/Maui. Piero Calucci. Scuola Internazionale Superiore di Studi Avanzati Trieste

Queue systems. and how to use Torque/Maui. Piero Calucci. Scuola Internazionale Superiore di Studi Avanzati Trieste Queue systems and how to use Torque/Maui Piero Calucci Scuola Internazionale Superiore di Studi Avanzati Trieste March 9th 2007 Advanced School in High Performance Computing Tools for e-science Outline

More information

Introduction to HPC Using zcluster at GACRC

Introduction to HPC Using zcluster at GACRC Introduction to HPC Using zcluster at GACRC Georgia Advanced Computing Resource Center University of Georgia Zhuofei Hou, HPC Trainer zhuofei@uga.edu Outline What is GACRC? What is HPC Concept? What is

More information

Introduction to High Performance Computing Using Sapelo2 at GACRC

Introduction to High Performance Computing Using Sapelo2 at GACRC Introduction to High Performance Computing Using Sapelo2 at GACRC Georgia Advanced Computing Resource Center University of Georgia Suchitra Pakala pakala@uga.edu 1 Outline High Performance Computing (HPC)

More information

ò mm_struct represents an address space in kernel ò task represents a thread in the kernel ò A task points to 0 or 1 mm_structs

ò mm_struct represents an address space in kernel ò task represents a thread in the kernel ò A task points to 0 or 1 mm_structs Last time We went through the high-level theory of scheduling algorithms Scheduling Today: View into how Linux makes its scheduling decisions Don Porter CSE 306 Lecture goals Understand low-level building

More information

New User Seminar: Part 2 (best practices)

New User Seminar: Part 2 (best practices) New User Seminar: Part 2 (best practices) General Interest Seminar January 2015 Hugh Merz merz@sharcnet.ca Session Outline Submitting Jobs Minimizing queue waits Investigating jobs Checkpointing Efficiency

More information

Scheduling. Don Porter CSE 306

Scheduling. Don Porter CSE 306 Scheduling Don Porter CSE 306 Last time ò We went through the high-level theory of scheduling algorithms ò Today: View into how Linux makes its scheduling decisions Lecture goals ò Understand low-level

More information

Process. One or more threads of execution Resources required for execution. Memory (RAM) Others

Process. One or more threads of execution Resources required for execution. Memory (RAM) Others Memory Management 1 Learning Outcomes Appreciate the need for memory management in operating systems, understand the limits of fixed memory allocation schemes. Understand fragmentation in dynamic memory

More information

Slurm basics. Summer Kickstart June slide 1 of 49

Slurm basics. Summer Kickstart June slide 1 of 49 Slurm basics Summer Kickstart 2017 June 2017 slide 1 of 49 Triton layers Triton is a powerful but complex machine. You have to consider: Connecting (ssh) Data storage (filesystems and Lustre) Resource

More information

Introduction to High-Performance Computing (HPC)

Introduction to High-Performance Computing (HPC) Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit CPU cores : individual processing units within a Storage : Disk drives HDD : Hard Disk Drive SSD : Solid

More information

COMP 3430 Robert Guderian

COMP 3430 Robert Guderian Operating Systems COMP 3430 Robert Guderian file:///users/robg/dropbox/teaching/3430-2018/slides/03_processes/index.html?print-pdf#/ 1/53 1 Processes file:///users/robg/dropbox/teaching/3430-2018/slides/03_processes/index.html?print-pdf#/

More information

Compiling applications for the Cray XC

Compiling applications for the Cray XC Compiling applications for the Cray XC Compiler Driver Wrappers (1) All applications that will run in parallel on the Cray XC should be compiled with the standard language wrappers. The compiler drivers

More information

PBS Pro Documentation

PBS Pro Documentation Introduction Most jobs will require greater resources than are available on individual nodes. All jobs must be scheduled via the batch job system. The batch job system in use is PBS Pro. Jobs are submitted

More information

CS2506 Quick Revision

CS2506 Quick Revision CS2506 Quick Revision OS Structure / Layer Kernel Structure Enter Kernel / Trap Instruction Classification of OS Process Definition Process Context Operations Process Management Child Process Thread Process

More information

Graham vs legacy systems

Graham vs legacy systems New User Seminar Graham vs legacy systems This webinar only covers topics pertaining to graham. For the introduction to our legacy systems (Orca etc.), please check the following recorded webinar: SHARCNet

More information

COSC243 Part 2: Operating Systems

COSC243 Part 2: Operating Systems COSC243 Part 2: Operating Systems Lecture 17: CPU Scheduling Zhiyi Huang Dept. of Computer Science, University of Otago Zhiyi Huang (Otago) COSC243 Lecture 17 1 / 30 Overview Last lecture: Cooperating

More information

Using the Yale HPC Clusters

Using the Yale HPC Clusters Using the Yale HPC Clusters Stephen Weston Robert Bjornson Yale Center for Research Computing Yale University Oct 2015 To get help Send an email to: hpc@yale.edu Read documentation at: http://research.computing.yale.edu/hpc-support

More information

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection Switching Operational modes: Store-and-forward: Each switch receives an entire packet before it forwards it onto the next switch - useful in a general purpose network (I.e. a LAN). usually, there is a

More information

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts Memory management Last modified: 26.04.2016 1 Contents Background Logical and physical address spaces; address binding Overlaying, swapping Contiguous Memory Allocation Segmentation Paging Structure of

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

SLURM Operation on Cray XT and XE

SLURM Operation on Cray XT and XE SLURM Operation on Cray XT and XE Morris Jette jette@schedmd.com Contributors and Collaborators This work was supported by the Oak Ridge National Laboratory Extreme Scale Systems Center. Swiss National

More information

Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat) Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat) HCC: Campus Grids Motivation We have 3 clusters in 2 cities. Our largest (4400 cores) is

More information

Process Description and Control. Chapter 3

Process Description and Control. Chapter 3 Process Description and Control Chapter 3 Contents Process states Process description Process control Unix process management Process From processor s point of view execute instruction dictated by program

More information

HiperDispatch Logical Processors and Weight Management

HiperDispatch Logical Processors and Weight Management HiperDispatch Logical Processors and Weight Management Fabio Massimo Ottaviani EPV Technologies August 2008 1 Introduction In the last few years, the power and number of the physical processors available

More information

Programs. Program: Set of commands stored in a file Stored on disk Starting a program creates a process static Process: Program loaded in RAM dynamic

Programs. Program: Set of commands stored in a file Stored on disk Starting a program creates a process static Process: Program loaded in RAM dynamic Programs Program: Set of commands stored in a file Stored on disk Starting a program creates a process static Process: Program loaded in RAM dynamic Types of Processes 1. User process: Process started

More information

An introduction to checkpointing. for scientific applications

An introduction to checkpointing. for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI An introduction to checkpointing for scientific applications November 2013 CISM/CÉCI training session What is checkpointing? Without checkpointing: $./count

More information

Process Description and Control. Major Requirements of an Operating System

Process Description and Control. Major Requirements of an Operating System Process Description and Control Chapter 3 1 Major Requirements of an Operating System Interleave the execution of several processes to maximize processor utilization while providing reasonable response

More information

Major Requirements of an Operating System Process Description and Control

Major Requirements of an Operating System Process Description and Control Major Requirements of an Operating System Process Description and Control Chapter 3 Interleave the execution of several processes to maximize processor utilization while providing reasonable response time

More information

PARDA: Proportional Allocation of Resources for Distributed Storage Access

PARDA: Proportional Allocation of Resources for Distributed Storage Access PARDA: Proportional Allocation of Resources for Distributed Storage Access Ajay Gulati, Irfan Ahmad, Carl Waldspurger Resource Management Team VMware Inc. USENIX FAST 09 Conference February 26, 2009 The

More information

Distributed OrcaFlex. 1. Introduction. 2. What s New. Distributed OrcaFlex

Distributed OrcaFlex. 1. Introduction. 2. What s New. Distributed OrcaFlex 1. Introduction is a suite of programs that enables a collection of networked, OrcaFlex licensed, computers to run OrcaFlex jobs as background tasks using spare processor time. consists of four separate

More information

SUBMITTING JOBS TO ARTEMIS FROM MATLAB

SUBMITTING JOBS TO ARTEMIS FROM MATLAB INFORMATION AND COMMUNICATION TECHNOLOGY SUBMITTING JOBS TO ARTEMIS FROM MATLAB STEPHEN KOLMANN, INFORMATION AND COMMUNICATION TECHNOLOGY AND SYDNEY INFORMATICS HUB 8 August 2017 Table of Contents GETTING

More information

XSEDE New User Tutorial

XSEDE New User Tutorial April 2, 2014 XSEDE New User Tutorial Jay Alameda National Center for Supercomputing Applications XSEDE Training Survey Make sure you sign the sign in sheet! At the end of the module, I will ask you to

More information

ARCHER/RDF Overview. How do they fit together? Andy Turner, EPCC

ARCHER/RDF Overview. How do they fit together? Andy Turner, EPCC ARCHER/RDF Overview How do they fit together? Andy Turner, EPCC a.turner@epcc.ed.ac.uk www.epcc.ed.ac.uk www.archer.ac.uk Outline ARCHER/RDF Layout Available file systems Compute resources ARCHER Compute

More information

High Performance Computing Cluster Advanced course

High Performance Computing Cluster Advanced course High Performance Computing Cluster Advanced course Jeremie Vandenplas, Gwen Dawes 9 November 2017 Outline Introduction to the Agrogenomics HPC Submitting and monitoring jobs on the HPC Parallel jobs on

More information

User Guide of High Performance Computing Cluster in School of Physics

User Guide of High Performance Computing Cluster in School of Physics User Guide of High Performance Computing Cluster in School of Physics Prepared by Sue Yang (xue.yang@sydney.edu.au) This document aims at helping users to quickly log into the cluster, set up the software

More information

Shadow: Real Applications, Simulated Networks. Dr. Rob Jansen U.S. Naval Research Laboratory Center for High Assurance Computer Systems

Shadow: Real Applications, Simulated Networks. Dr. Rob Jansen U.S. Naval Research Laboratory Center for High Assurance Computer Systems Shadow: Real Applications, Simulated Networks Dr. Rob Jansen Center for High Assurance Computer Systems Cyber Modeling and Simulation Technical Working Group Mark Center, Alexandria, VA October 25 th,

More information

Using the computational resources at the GACRC

Using the computational resources at the GACRC An introduction to zcluster Georgia Advanced Computing Resource Center (GACRC) University of Georgia Dr. Landau s PHYS4601/6601 course - Spring 2017 What is GACRC? Georgia Advanced Computing Resource Center

More information

Guillimin HPC Users Meeting March 17, 2016

Guillimin HPC Users Meeting March 17, 2016 Guillimin HPC Users Meeting March 17, 2016 guillimin@calculquebec.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Outline Compute Canada News System Status Software Updates Training

More information

OPERATING SYSTEMS & UTILITY PROGRAMS

OPERATING SYSTEMS & UTILITY PROGRAMS OPERATING SYSTEMS & UTILITY PROGRAMS System Software System software consists of the programs that control the operations of the computer and its devices. Functions that system software performs include:

More information

Operating Systems. Introduction & Overview. Outline for today s lecture. Administrivia. ITS 225: Operating Systems. Lecture 1

Operating Systems. Introduction & Overview. Outline for today s lecture. Administrivia. ITS 225: Operating Systems. Lecture 1 ITS 225: Operating Systems Operating Systems Lecture 1 Introduction & Overview Jan 15, 2004 Dr. Matthew Dailey Information Technology Program Sirindhorn International Institute of Technology Thammasat

More information