Introduction to High Performance Computing at Case Western Reserve University Research Computing and CyberInfrastructure team KSL Data Center Presenters Emily Dragowsky Daniel Balagué Guardia Hadrian Djohari Sanjaya Gajurel
Bootcamp Outline Who we are Case HPC resources Working with the Cluster Basic Linux Job Scripting Open Discussion/Q&A
Bootcamp Outline Who we are Case HPC resources Working with the Cluster Basic Linux Job Scripting Open Discussion/Q&A
Who we are Research Computing and CyberInfrastructure Team RCCI 5th floor, overlooking Euclid [U]TECH University Staff, academic ties CWRU grads Research group members Skilled practitioners Strong collaboration with Network, Servers and Storage teams
RCCI Services Cyberinfrastructure High Performance Computing Research Networking services Research Storage and Archival solutions Secure Research Environment for computing on regulated data Support Education and Awareness Consultation and Award Pre-support Database Design Visualization Programming Services Concierge for off-premise services (XSEDE,OSC,AWS) Public Cloud and Off- Premise Services
CASE HPC Cluster Designed for computationally intensive jobs long-running, number crunching Optimized for batch jobs combine resources as needed (cpu, memory, gnu) Supports interactive/graphically intensive jobs OS version emphasizes stability Linux (Red Hat Enterprise Linux 6.8) Accessible from Linux, Mac and Windows Some level of Linux expertise is needed - why we re here today Clusters: redcat (slurm), and hadoop
HPC Cluster Glossary Head Nodes: Development, Analysis, Job Submission Compute Nodes: Computational Computers Panasas: Engineered File System, fastest storage DELL Fluid File System: Value storage Data Transfer Nodes: hpctransfer, dtn1 Science DMZ: lowest resistance Data Pathway SLURM: Cluster workload manager (Job Scheduler)
HPC Cluster Components Resource Manager redcat.case.edu Science DMZ Dell FFS Storage Head Nodes SLURM Master Admin Nodes Data Transfer Nodes Panasas Storage Batch nodes GPU nodes SMP nodes
HPC Cluster Components Resource Manager redcat.case.edu University Science DMZ Dell FFS Storage Head Nodes SLURM Master Admin Nodes Firewall Data Transfer Nodes Panasas Storage Batch nodes GPU nodes SMP nodes
Working on the Cluster How To: ~ access the cluster ~ get my data onto the cluster ~ establish interactive sessions <break> ~ submit jobs through the scheduler ~ monitor jobs a.k.a. why is my job not running?? work with others within the cluster
You can login from anywhere You will need: An approved cluster account Enter your CaseID and the Single Sign-On password ssh (secure shell) utility [detailed instructions for all platforms] We recommend x2go-client Putty or cygwin (Windows), Terminal (Mac/Linux) will work for non-graphical output sessions. If Off-campus Location, then Connect through VPN, using two-factor authentication Case Guest wireless == off-campus
HPC Environment Your Full Cluster Resources Your HPC account, sponsored by your PI, provides: Group affiliation resources shared amongst group members Storage /home permanent storage, replicated & snapshot protected /scratch/pbsjobs up to 1 TB temporary storage /scratch/users small-scale temporary storage exceeding quota(s) will prevent using account Cores: member groups allocation of 32+ for an 8-share Wall-time: 320-hour limit for member shares (32 hours for guest shares)
HPC Environment Your /home Allocated storage space in the HPC filesystem for your work Create subdirectories underneath your /home/caseid, ideally each job has its own subdirectory cd linux command to change the current directory examples to change to home cd /home/<caseid> cd ~<CaseID> cd $HOME $HOME is an environment variable that points to /home/<caseid>
You are not alone. > ls /home
HPC Environment Beyond /home Linux systems have hierarchical directory structure User files: /home System files: /bin, /dev, /etc, /log, /opt, /var Application files: /usr/local/<module>/<version> Consider Python: 4 versions installed /bin/python 2.6.6 /usr/local/python/ 2.7.8 2.7.10 3.5.2
HPC Environment Environment Variables Keeping organized echo $PATH /home/mrd20/bin/grom5/bin:/home/mrd20/bin:/usr/local/i/1.0.0/bin:/usr/local/openmpi/1.8.8/bin:/usr/ local/intel/2015/composer_xe_2015.3.187/bin/intel64:/usr/local/munge/bin:/usr/local/slurm/bin:/usr/ local/slurm/sbin:/usr/lib64/qt-3.3/bin:/usr/local/emflex/1-j.11/wai/flex/programs:/usr/local/bin:/bin:/usr/ bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/dell/srvadmin/bin echo $LD_LIBRARY_PATH /home/mrd20/bin/grom5/lib64:/usr/local/openmpi/1.8.8/lib:/usr/local/intel/2015/ composer_xe_2015.3.187/mkl/lib/intel64:/usr/local/intel/2015/composer_xe_2015.3.187/compiler/lib/ intel64:/usr/local/munge/lib:/usr/local/slurm/lib:/usr/lib:/usr/lib64:/usr/local/lib
Modules and Environment Module command: avail, list, load, unload Manage the environment necessary to run your applications (binary, libraries, shortcuts) Using the module commands will set or remove the environment variables: >>module avail (or module avail python) >>module list (shows modules loaded in your environment) >>module load python (loads default version) >>module load python/3.5.2 (loads specific version) >>module unload python/3.5.2 (unloads specific version) -------------------------------------------------------------------
Modules and Environment [mrd20@hpc2 ~]$ module list Currently Loaded Modules: Module command: list & display 1) intel/2015 2) openmpi/1.8.8 3) i/1.0.0 4) StdEnv 5) python/2.7.8 [mrd20@hpc2 ~]$ module display python ---------------------------------------------------------------------------- /usr/local/share/modulefiles/python/2.7.8: ---------------------------------------------------------------------------- whatis("a powerful high-level programming language ") prepend_path("path","/usr/local/python/2.7.8/bin") prepend_path("cplus_include_path","/usr/local/python/2.7.8/include") prepend_path("c_include_path","/usr/local/python/2.7.8/include") prepend_path("ld_library_path","/usr/local/python/2.7.8/lib") prepend_path("library_path","/usr/local/python/2.7.8/lib") prepend_path("pkg_config_path","/usr/local/python/2.7.8/lib/pkgconfig")
Data Transfer scp command scp [-12346BCpqrv] [-c cipher] [-F ssh_config] [-i identity_file] [-l limit] [-o ssh_option] [-P port] [-S program] [[user@]host1:]file1... [[user@]host2:]file2 Copy from HPC to your local PC scp -r mrd20@redcat.case.edu:/home/mrd20/data/vlos.dat.. full stop means this directory From your PC to HPC scp orange.py mrd20@redcat: : colon denotes hostname
Data Transfer GLOBUS Setup Instructions: https://sites.google.com/a/case.edu/hpc-upgraded-cluster/ home/important-notes-for-new-users/transferring-files
Start an Interactive GUI Session Create a session on compute node, not on the head node srun Create a job allocation (if needed) and launch a job step srun --x11 [-p batch -n 4 -t 1:00:00] --pty /bin/bash --x11 invokes X-forwarding --pty psuedoterminal, type of shell = bash -p -n -t partition (batch, gpufermi, gpuk40, smp) nodes duration of resource allocation
Examples: Interactive GUI Session Accepting the defaults srun --x11 --pty /bin/bash More tasks (default 1 cpu-per-task) srun --x11 -p batch -n 4 -t 1:00:00 --pty /bin/bash Graphically intensive session (default duration 10 hours) srun --x11 -p gpufermi --gres:gpu=2 -n 12 --pty /bin/bash
Now Let s Take time for reflection beverages stretching the legs washing of hands booking a flight checking email quiet contemplation talking with our neighbors
Working Big on the CWRU HPC Cluster Many people at once Many jobs running, and queued awaiting resources Slurm workload manager software has three key functions: allocates access to resources (compute nodes) to users for some duration of time so they can perform work. Provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Arbitrates contention for resources by managing a queue of pending work.
Monitor Cluster Status Workload management for collective benefit of HPC community sinfo View information about Slurm nodes and partitions sinfo [flags] -n nodes by name -o format output: sinfo -o "%10P %.3a %.10l %.4D %.8t %.14C %N" PARTITION AVA TIMELIMIT NODE STATE CPUS(A/I/O/T) NODELIST si script invoking sinfo with a set of standard flags exercise: > less `which si`, examine bash script contents
Submit a Job through the Scheduler Workload management for collective benefit of HPC community sbatch Create a resource allocation request to launch a job step sbatch [-p batch -N 1 -t 2-1:00:00] script script a bash shell script -p -N -t partition (batch, gpufermi, gpuk40, smp) nodes duration of resource allocation [dd-hh:mm:ss] Other common flags: -A, --ntasks, --cpus-per-task, --mem-per-cpu
Example Job Script hexacarbonyl-16.slurm #/bin/bash #SBATCH --time=4:00:00 #SBATCH --nodes=2 #SBATCH --ntasks=6 #SBATCH --cpus-per-task=2 #SBATCH --job-name=hexacarbonyl-16_job # Load the Gaussian module module load gaussian/16-sse # Run Gaussian srun g16 hexacarbonyl-16.com
Checking Job Status (I) squeue view information about jobs in scheduling queue squeue [options] -u <caseid> -A <PI caseid> -l standard long output fields -o select fields for output (~90 fields exist) - -start show estimated start times for pending jobs full documentation: slurm.schedmd.com/squeue.html
Checking Job Status (II) scontrol view and modify Slurm configuration and state most functionality reserved for system administrators scontrol [options] [commands] scontrol show job <jobid> scontrol show node <nodename> (refer to HPC Resource View)
Working within Group Allocations Group Name / ID: tas35 / 10085 (guest) Resources CPUs RAM max duration: 1-12:00:00 Checking group usage with squeue: squeue -o "%A %C %e %E %g %l %m %N %T %u" awk 'NR==1 /eecs600/' JOBID CPUS END_TIME DEPENDENCY GROUP TIME_LIMIT MIN_MEMORY NODELIST STATE USER 148137 1 2016-01-26T16:54:22 eecs600 2:00:00 1900 comp145t RUNNING aar93 148146 1 2016-01-27T01:14:27 eecs600 10:00:00 1900 comp148t RUNNING hxs356
SLURM Resources Reading List Case HPC SLURM command summary CPU Management User and Administrator Guide http://slurm.schedmd.com/cpu_management.html Support for Multi-core/Multi-Thread Architectures http://slurm.schedmd.com/mc_support.html Slides from Tutorial for Beginners http://www.schedmd.com/cray/tutorial.begin.pdf SLURM manual pages http://slurm.schedmd.com/<command>.html
Case Cluster: How to Learn Web Search: CWRU HPC https://sites.google.com/a/case.edu/hpc-upgraded-cluster/home hpc-support@case.edu
Summary Headnodes reserved for organizing work Compute nodes meant for performing work Low-Impedence Network for large-scale Data Transfer SLURM Workload Manager & Scheduler RCCI Staff on-hand for aid Jump in and learn hpc-support@case.edu RCCI Team: Roger Bielefeld, Mike Warfe, Hadrian Djohari Daniel Balagué, Brian Christian, Emily Dragowsky, Jeremy Fondran, Sanjaya Gajurel, Matt Garvey, Theresa Griegger, Cindy Martin, Lee Zickel