Advanced cluster techniques with LoadLeveler

Size: px

Start display at page:

Download "Advanced cluster techniques with LoadLeveler"

Pauline Ward
5 years ago
Views:

1 Advanced cluster techniques with LoadLeveler How to get your jobs to the top of the queue Ciaron Linstead 10th May 2012

2 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 2

3 Introduction Resources on the cluster Job priority Using multiple jobsteps Useful techniques and new tools Ciaron Linstead IT Services 3

4 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 4

5 LoadLeveler Recap Schedules workload by matching jobs to available resources Typical workflow: Write a Job Command File (JCF) a shell script with LL-specific instructions ( ) use llsubmit to start a run llsubmit example.jcf check progress with llq [-l] check cluster load with llstatus [-l] check class load with llclass [-l] Ciaron Linstead IT Services 5

6 The Job Command File Simple serial (one-task) example 1 #!/bin/bash 2 job_name = hello_world 3 class = short 4 group = its 5 notify_user = linstead 6 output = /scratch/01/$(user)/$(job_name)_$(cluster)_$(stepid).out 7 error = /scratch/01/$(user)/$(job_name)_$(cluster)_$(stepid).err 8 queue 9 10 time /home/linstead/examples/c/hello Lines 6 and 7 use variables to send output/error from different runs to different files Ciaron Linstead IT Services 6

7 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 7

8 CPU layout on the idataplex cluster 320 machines (nodes) with 8 CPUs each 2x 4-core Intel Xeon processors 1 task gets 1 CPU Ciaron Linstead IT Services 8

9 Mapping tasks to nodes Parallel applications: layout can help performance Minimise network connections with dense packing of tasks on nodes Tasks on the same node use (faster) shared memory to communicate Maximise memory or disk IO bandwidth per task by sparse packing (and perhaps not sharing the node with other user s jobs) Ciaron Linstead IT Services 9

10 Method 1: total tasks and blocking total tasks = 24...I have this many tasks/processes blocking = unlimited...and I don t care where they re located or blocking = 4...put at most 4 of my tasks on a node Ciaron Linstead IT Services 10

11 Method 2: node and tasks per node node = 3...I want this many nodes tasks per node = 8...and this many tasks per node Same as total tasks = 24 && blocking = 8 Ciaron Linstead IT Services 11

12 Method 3: task geometry Useful if I want to take advantage of the communication pattern of my program Tasks on the same node use shared memory instead of the network to communicate e.g. six tasks on four different nodes: task_geometry={(0,1) (3) (5,4) (2)} Ciaron Linstead IT Services 12

13 shared vs. not shared nodes Nodes share memory, network and IO bandwidth I can specify that I need all the resources! node_usage = not_shared or using LoadLeveler resources keyword ConsumableCpus means the number of CPUs each task needs resources = ConsumableCpus(8) Ciaron Linstead IT Services 13

14 unshared nodes lead to: long queue times (LL needs to reserve entire nodes) low overall utilisation of the cluster (bad for our statistics!) Ciaron Linstead IT Services 14

15 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 15

16 Physical memory Total RAM per node: 32GB (minus operating system) = 28GB default: 3.5GB per core (28/8) largemem class: 14GB for each of 2 cores (6 cores idle) set in system configuration with ulimit Ciaron Linstead IT Services 16

17 Available memory Linux will kill processes if the node runs out of memory (OOM-killer) Sometimes includes the LoadLeveler Starter Daemon Very bad things happen on an interactive (login) node (filesystem daemons, LoadLeveler daemons disappearing) limited per-process memory with ulimit malloc-like functions will fail if ulimit bound is reached check return values R loads workspaces on startup from.rdata (use no-restore-data or no-restore) Ciaron Linstead IT Services 17

18 LoadLeveler s Consumable Resources: Memory resources = ConsumableMemory(count) Doesn t enforce limits (unlike ulimit) Just use the default Ciaron Linstead IT Services 18

19 Measure memory usage valgrind/massif, gprof (C, Fortran) Intel VTune memory profiler, Heapy, PySizer (Python) Ciaron Linstead IT Services 19

20 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 20

21 How LoadLeveler calculates job priority jobs are dispatched based on priority, but can be out-of-order, depending on resources SYSPRIO = (ClassSysprio * 100) + (UserSysprio * 10) + (GroupSysprio * 1) - (GroupRunningJobs) - (UserTotalJobs) Class priority goes from short (high) to long (low) All Users (and Groups) have equal priority You can prioritise your own jobs: user priority = n (0-100, default 50) re-prioritise queued jobs with llprio Ciaron Linstead IT Services 21

22 How you can influence job dispatch time Use a different class shorter running classes have higher priority Also important: wall clock limit defaults for short, medium, long are 1, 7 and 30 days jobs are stopped (SIGTERM) at the limit New (lower) limit can be set: wall clock limit = HH:MM:SS Ciaron Linstead IT Services 22

23 Aside: how the backfill scheduler works runs jobs out-of-order according to available resources jobs have a known start time and the wall clock limit so LoadLeveler knows the latest start time of the highest priority queued job (Could be earlier, if jobs finish before wall clock limit reached) LL won t start lower priority jobs if they would delay the start of the highest priority job (the top-dog )...even if there are unused resources Ciaron Linstead IT Services 23

24 Aside: how the backfill scheduler works Job 4 sets a lower wall-clock-limit than the default and can be backfilled. Ciaron Linstead IT Services 24

25 How you can influence job dispatch time Use a lower wall clock limit: wall_clock_limit = HH:MM:SS Ciaron Linstead IT Services 25

26 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 26

27 Using multiple jobsteps run a program multiple times with different input/output data with one JCF or do data-staging or post-processing, e.g. 1st jobstep: Use class io with 1 task to fetch archived input data 2nd jobstep: Use class short with n tasks to do model run 3rd jobstep: Use class io with 1 task to archive output data Ciaron Linstead IT Services 27

28 Multi-step jobs: Independent jobsteps Run multiple independent steps with one JCF 1 executable = longjob 2 input = longjob.in.$(stepid) 3 output = longjob.out.$(jobid).$(stepid) 4 error = longjob.err.$(jobid).$(stepid) 5 queue 6 queue 7 queue 8 queue 9 queue (Use $(stepid) to differentiate input, output and error files) Ciaron Linstead IT Services 28

29 Multi-step jobs: Dependent jobsteps Run job steps with dependencies on previous steps 1 step_name = step1 2 executable = executable1 3 input = step1.in1 4 output = step1.out1 5 error = step1.err1 6 queue 7 dependency = (step1 == 0) 8 step_name = step2 9 input = step2.in1 10 output = step2.out1 11 error = step2.err1 12 queue (Both steps use the same executable) Ciaron Linstead IT Services 29

30 Multi-step jobs: Dependent jobsteps Run job steps with dependencies on previous steps, different executables (lines 2 and 8) 1 step_name = step1 2 executable = executable1 3 input = step1.in1 4 output = step1.out1 5 error = step1.err1 6 queue 7 dependency = (step1 == 0) 8 executable = executable2 9 step_name = step2 10 input = step2.in1 11 output = step2.out1 12 error = step2.err1 13 queue Status indicators in llq: [C]ompleted, [N]ot[Q]ueued Ciaron Linstead IT Services 30

31 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 31

32 Common errors Most errors cause submission to fail: llsubmit: Class "short" is not valid for group "itss". llsubmit: This job has not been submitted to LoadLeveler. Job stays Idle cws02a linstead 5/7 11:06 I 50 short Waiting for resources, e.g. ConsumableCpus=9 Job keeps switching between [I]dle and [ST]arting cws02a linstead 5/7 11:06 I 50 short cws02a linstead 5/7 11:06 ST 50 short Checking for input, e.g. executable = /file/doesn t/exist Ciaron Linstead IT Services 32

33 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 33

34 Watch out for: unwanted restarts: Set restart = no prevent LoadLeveler from restarting your job after a machine failure...unless your code can cope with a restart (e.g. overwriting output files) what your application does with SIGTERM LL uses SIGTERM to cancel jobs some models trap SIGTERM to clean up but don t exit LoadLeveler gets confused Ciaron Linstead IT Services 34

35 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 35

36 New tools Python 2.7 and 3.2 Python pip and virtualenv for installing your own packages (2.7 only) Distributed and Parallel Matlab (with 16 worker licences) Submit Matlab tasks to compute nodes instead of running on login nodes No need to keep the Matlab client open Intel VTune with sampling driver on login01 (profile code hotspots/cache misses/branch mispredicts) Ciaron Linstead IT Services 36

37 Outline 1 Introduction 2 LoadLeveler recap 3 CPUs 4 Memory 5 Factors affecting job priority 6 Multi-step jobs 7 Common errors 8 Useful bits and pieces 9 New tools on the cluster 10 Finally Ciaron Linstead IT Services 37

38 Summary Default resources Specify job requirements for better performance Use multiple jobsteps for more flexible runs New tools and useful options Ciaron Linstead IT Services 38

39 New cluster 2014 Bid invitation preparation: mid-2013 What do you like about the cluster? What do you dislike? What s on your wishlist? Ciaron Linstead IT Services 39

40 Thank you! Ciaron Linstead IT Services 40

41 References Cluster documentation (inc. these slides): TWS LoadLeveler - Using and Administering: pik-potsdam.de/members/linstead/documentation IBM Redbook - Workload Management with LoadLeveler: documentation Distributed Matlab at PIK http: // Python memory profiler: /line-by-line-report-of-memory-usage/ Valgrind: Ciaron Linstead IT Services 41

Job Management on LONI and LSU HPC clusters

Job Management on LONI and LSU HPC clusters Le Yan HPC Consultant User Services @ LONI Outline Overview Batch queuing system Job queues on LONI clusters Basic commands The Cluster Environment Multiple