Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 Email:plamenkrastev@fas.harvard.edu

Objectives Inform you of available computational resources Help you choose appropriate computational resources for your research Provide guidance for scaling up your applications and performing computations more efficiently More efficient use = more resources available to do research Enable you to Work smarter, better, faster Slide 2

Outline Choosing computational resources Overview of available RC resources Partition / Queue Time Number of nodes and cores Memory Storage Examples Slide 3

What resources do I need? Is my code serial or parallel? How many cores and/or nodes does it need? How much memory does it require? How long does my code take to run? How big is the Input / Output Data for each run? How is the input data read by the code (e.g., hardcoded, keyboard, parameter/data file(s), external database/website, etc.)? Slide 4

What resources do I need? How is the output data written by the code (standard output/screen, data file(s), etc.)? How many tasks/jobs/runs do I need to complete? What is my timeframe / deadline for the project (e.g., paper, conference, thesis, etc.)? What computational resources are available at Research Computing? Slide 5

RC resources: Odyssey Odyssey is a large scale heterogeneous HPC cluster Compute: 60,000+ compute cores (and increasing) Cores per node: 8 to 64 Memory per node: 12GB to 512GB (4GB/core) 1,000,000+ NVIDIA GPU cores Storage: Over 35PB of storage Home directories: 100GB Lab space: Initial 4TB at $0 with expansion on a TB basis available for purchase at $45/TB/year Local scratch: 270GB/node Global Scratch: High-performance shared scratch: 1 PB total, Lustre file system https://rc.fas.harvard.edu/resources/odyssey-storage Slide 6

RC resources: Odyssey Odyssey is a large scale heterogeneous HPC cluster Software: CentOS SLURM job manager 1,000+ scientific tools and programs https://portal.rc.fas.harvard.edu/apps/modules Interconnect: 2 underlying networks connecting 3 data centers TCP/IP network Low-latency 56 GB/s InfiniBand network: inter-node parallel computing, fast access to Lustre mounted storage Hosted Machines: 300+ virtual machines Lab instrument workstations Slide 7

Available Storage Home Directories Lab Storage Local Scratch Global Scratch Persistent Research Data Size Limit 100GB 4TB+ 270GB/node 1.2PB total 3PB Availability All cluster nodes + Desktop/laptop All cluster nodes + Desktop/laptop Local compute node only. All cluster nodes Only IB connected cluster nodes Backup Hourly snapshot + Daily Offsite Daily Offsite No backup No backup External Repos No backup Retention Policy Indefinite Indefinite Job duration 90 days 3-9 months Performance Moderate. Not suitable for high I/O Moderate. Not suitable for high I/O Suited for small file I/O intensive jobs Appropriate for large file I/O intensive jobs Appropriate for large I/O intensive jobs Cost Free 4TB Free + Expansion at $45/TB/yr Free Free Free Slide 8

Partition / Queue Time Limit general serial_requeue interact bigmem unrestricted Lab queues 7 days 7 days 3 days no limit no limit no limit # Nodes 177 1071 8 7 8 1154 # Cores / Node Memory / Node (GB) 64 8-64 64 64 64 8-64 256 12-512 256 512 256 12-512 https://rc.fas.harvard.edu/resources/running-jobs/#slurm_partitions Batch jobs: #SBATCH -p general # Partition name Interactive or test jobs: srun -p interact OTHER_OPTIONS Slide 9

Time How long does my code take to run? Batch jobs: #SBATCH -p serial_requeue #SBATCH -t 0-02:00 #Time in D-HH:MM Interactive or test jobs: srun -t 0-02:00 -p interact OTHER_JOB_OPTIONS Slide 10

Number of nodes and cores Is my code serial or parallel? Serial (single-core) jobs Batch jobs: #SBATCH -p serial_requeue #SBATCH -c 1 # Number of cores Interactive or test jobs: srun -c 1 -p interact OTHER_JOB_OPTIONS Core / Thread / Process / CPU Slide 11

Number of nodes and cores Parallel shared memory (single node) jobs Examples: OpenMP (Fortran, C/C++) MATLAB Parallel Computing Toolbox (PCT) Python (e.g., threading, multiprocessing) R (e.g., multicore) Batch jobs: #SBATCH -p general # Partition #SBATCH -N 1 # Number of nodes #SBATCH -c 4 # Number of cores (per task) srun -c 4 PROGRAM PROGRAM_OPTIONS Interactive or test jobs: srun -p interact -N 1 -c 4 OTHER_OPTIONS Slide 12

Number of nodes and cores Parallel distributed memory (multi-node) jobs Examples: MPI (openmpi, impi, mvapich) with Fortran or C/C++ code MATLAB Distributed Computing Server (DCS) Python (e.g., mpi4py) R (e.g., Rmpi, snow) Batch jobs: #SBATCH -p general # Partition #SBATCH -n 4 # Number of tasks srun -n 4 PROGRAM PROGRAM_OPTIONS Interactive or test jobs: srun -p interact -n 4 OTHER_OPTIONS Slide 13

Memory Serial and parallel shared memory (single node) jobs Batch jobs: #SBATCH -p serial_requeue # Partition #SBATCH --mem=4000 # Memory / node in MB Interactive or test jobs: srun --mem=4000 -p interact OTHER_OPTIONS Parallel distributed memory (multi-node) jobs Batch jobs: #SBATCH -p general #SBATCH -n 4 #SBATCH --mem-per-cpu=4000 # Partition # Number of tasks # Memory / core in MB Interactive or test jobs: srun --mem-per-cpu=4000 -n 4 -p interact OTHER_OPTIONS Slide 14

Memory How much memory does my code require? Understand your code and how the algorithms scale analytically Run an interactive job and monitor memory usage (with the top Unix command) Run a test batch job and check memory usage after the job has completed (with the sacct SLURM command) Slide 15

Memory Know your code Example: A real*8 (Fortran), or double (C/C++), matrix of dimension 100,000 X 100,000 requires ~80GB of RAM Data Type: Fortran / C Bytes integer*4 / int 4 integer*8 / long 8 real*4 / float 4 real*8 / double 8 complex*8 / float complex 8 complex*16 / double complex 16 Slide 16

Memory Run an interactive job and monitor memory usage (with the top Unix command) Example: Check the memory usage of a matrix diagonalization code Request an interactive bash shell session: srun -p interact -n 1 -t 0-02:00 --pty --mem=4000 bash Run the code, e.g.,./matrix_diag.x Open a new shell terminal and ssh to the compute node where the interactive job dispatched, e.g., ssh holy2a18307 In the new shell terminal run top, e.g., top -u pkrastev Slide 17

Memory Run 1: Matrix dimension = 3000 X 3000 (real*8) Needs 3,000 X 3000 X 8 / 1000000 = ~72 MB of RAM Slide 18

Memory Run 2: Input size changed Double matrix dimension, Quadrupole required memory Matrix dimension = 6000 X 6000 (real*8) Needs 6,000 X 6000 X 8 / 1000000 = ~288MB of RAM Slide 19

sacct overview sacct = SLURM accounting database every 30 sec the node collects the amount of CPU and memory usage that all of the process IDs are using for a given job. After the job ends this data is set to slurmdb. Common flags j jobid or name=jobname S YYYY-MM-DD and E YYYY-MM-DD o ouput_options JobID,JobName,NCPUS,Nnodes,Submit,Start,End,CPUTime,TotalCPU,ReqMem, MaxRSS,MaxVMSize,State,Exit,Node http://slurm.schedmd.com/sacct.html Slide 20

Memory Run a test batch job and check memory usage after the job has completed (with the sacct SLURM command) Example: [pkrastev@sa01 Resources]$ sacct -o ReqMem,MaxRSS -j 70446364 ReqMem MaxRSS ---------- ---------- 320Mn 286648K or MaxRSS = 286648KB = 286.648MB ReqMem = 320MB or 10% > MaxRSS https://rc.fas.harvard.edu/resources/faq/how-to-know-what-memory-limit-to-put-on-my-job Slide 21

Storage Home directories, /n/home*, and Lab storage are not appropriate for I/O intensive or large number of jobs. Typical utilization would be jobscripts, and in-house analysis codes or self-installed software For jobs that create high-volume of small files (< 10 MB), use local scratch. You need to copy your input data to /scratch and move output data to a different location after the job completes For I/O intensive jobs large data files (> 100 MB) and/or large number of data files (100s of 10-100MB) use the global scratch file-system /n/regal https://rc.fas.harvard.edu/policy-scratch Slide 22

Storage 60 Oxford St Initial Lab shares (4TB) Legacy equipment 1 Summer Street Personal home directories Purchased lab shares Older Lab owned compute nodes Holyoke, MA Global scratch high-performance filesystem Compute nodes > 2012 (33K+ cores) Topology may affect the efficiency of your work! For best performance storage needs to be closer to compute Slide 23

Storage Utilization Use du Unix command to check disk usage, e.g., du -h $HOME... 37G /n/home06/pkrastev https://en.wikipedia.org/wiki/du_(unix) Slide 24

Examples Serial application #!/bin/bash #SBATCH -J lapack_test #SBATCH -o lapack_test.out #SBATCH -e lapack_test.err #SBATCH -p serial_requeue #SBATCH -t 0-00:30 #SBATCH -N 1 #SBATCH -c 1 #SBATCH --mem=4000 # Load required modules source new-modules.sh # Run program./lapack_test.x Slide 25

Examples Parallel OpenMP (single-node) application #!/bin/bash #SBATCH -J omp_dot #SBATCH -o omp_dot.out #SBATCH -e omp_dot.err #SBATCH -p general #SBATCH -t 0-02:00 #SBATCH -N 1 #SBATCH -c 4 #SBATCH --mem=16000 # Set up environment source new-modules.sh export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # Run program srun -c $SLURM_CPUS_PER_TASK./omp_dot.x Slide 26

Examples MATLAB Parallel Computing Toolbox (single-node) application #!/bin/bash #SBATCH -J parallel_monte_carlo #SBATCH -o parallel_monte_carlo.out #SBATCH -e parallel_monte_carlo.err #SBATCH -N 1 #SBATCH -c 8 #SBATCH -t 0-03:30 #SBATCH -p general #SBATCH --mem=32000 # Load required software modules source new-modules.sh module load matlab/r2016a-fasrc01 # Run program srun -n 1 -c 8 matlab-default -nosplash -nodesktop -r "parallel_monte_carlo;exit" Slide 27

Examples Parallel MPI (multi-node) application #!/bin/bash #SBATCH -J planczos #SBATCH -o planczos.out #SBATCH -e planczos.err #SBATCH -p general #SBATCH -t 30 #SBATCH -n 8 #SBATCH --mem-per-cpu=4000 # Load required modules source new-modules.sh module load intel/15.0.0-fasrc01 module load openmpi/1.8.3-fasrc02 # Run program srun -n 8 --mpi=pmi2./planczos.x https://github.com/fasrc/user_codes Slide 28

Test first Before diving right into submitting 100s or 1000s of research jobs, ALWAYS test a few first. ensure the job will finish to completion without errors ensure you understand the resources needs and how they scale with different data sizes and input options Slide 29

Contact Information Harvard Research Computing Website: http://rc.fas.harvard.edu Email: rchelp@fas.harvard.edu plamenkrastev@fas.harvard.edu Office Hours: Wednesdays noon 3pm 38 Oxford Street, 2 nd Floor Conference Room Slide 30