Research Computing UNIVERSITY OF COLORADO The JANUS Computing Environment Monte Lunacek monte.lunacek@colorado.edu rc-help@colorado.edu
What is JANUS? November, 2011 1,368 Compute nodes 16,416 processors ~ 20 GB of available space ~ 800 TB of storage 2.8Ghz Intel Westmere TFLOPS is a rate of execution, trillions of floating point operations per second
NUMA Architecture Resource Management and queues Different architectures Parallel file systems Lots of ways to do something... Explicit environment
Online resources www.rc.colorado.edu
Overview Access Login, file system, data transfer Software Supported software, dotkits, building software Resource Management Queues, Moab, and Torque Running Jobs Single-core, load-balanced, MPI, OpenMP Questions
Access
Login Procedure ssh <username>@login.rc.colorado.edu Password: Yubikeys or Cryptocards
RC Filesystem Home directory /home/<user_name> 2 Gb, Network File System (NFS) Project space Build software here /projects/<user_name> 250 Gb, NFS Scratch space Run software here /lustre/janus_scratch/<user_name> No quota, no backup Lustre file system
Snapshot Did you accidentally remove a file or directory? $HOME/.snapshot/hourly.[0-12] $HOME/.snapshot/nightly.[0-6] $HOME/.snapshot/weekly.[0-7] Example rm $HOME/bugreport.csh cp $HOME/.snapshot/weekly.0/bugreport.csh $HOME Where? $HOME/.snapshot /projects/<user_name>/.snapshot
Lustre Scalable, POSIX-compliant parallel file system designed for large, distributed-memory systems Object Storage Targets (OST) Store user file data Object Storage Servers (OSS) Control I/O access and handling network request Metadata Target (MDT) Stores filenames, directories, permissions and file layout Metadata Server (MDS) Assigns storage locations associated with each file in order to direct file I/O requests to the correct set of OST
Metadata server (MDS) and target (MDT) MDS MDT IB OSS OST Object storage server (OSS) and target (OST)
File Access MDS MDT IB OSS OST Compute node requests storage location Compute node then interacts directly with OST
Striping File - contiguous sequence of bytes /file Key feature: Lustre file system can distribute these segments multiple OSTs using a technique called file striping. A file is said to be striped when its contiguous sequence of bytes is separated into small chunks, or stripes, so that read and write operations can access multiple OSTs concurrently. /file
File I/O Serial File-per-process /file /file1 /file2 /filen Shared file /file Collective Buffering: Not currently supported on JANUS
Single processor 800 600 write speed (Mb/s) 400 Transfer size 1 mb 32 mb 200 0 1 2 4 8 15 30 60 stripe count
File per processor 12000 10000 8000 write speed (Mb/s) 6000 4000 2000 0 1 2 4 8 16 32 64 128 256 512 1024 2048 processors (files)
Shared-file with striping 7000 6000 5000 write speed (Mb/s) 4000 3000 2000 1000 1 2 4 8 16 32 64 128 256 1024 processors (files)
Examples bash-janus> mkdir temp_dir bash-janus> lfs setstripe -c 3 temp_dir bash-janus> touch temp_dir/temp_file bash-janus> lfs getstripe temp_dir temp_dir stripe_count: 3 stripe_size: 33554432 stripe_offset: -1 temp_dir/temp_file lmm_stripe_count: 3 lmm_stripe_size: 33554432 lmm_stripe_offset: 18 obdidx objid objid group 18 12787913 0xc320c9 0 7 12863377 0xc44791 0 23 12496893 0xbeaffd 0
Data transfer https://www.rc.colorado.edu/crcdocs/file-transfer Grid FTP GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks Globus Online Utilities Large file transfers with drag and drop archiving to move data between its longtime archival storage and compute systems scp, sftp, rsync Good for small files
Access tips Control Sockets One-time passwords make multiple terminal sessions and file transer painful. mkdir -p ~/.ssh/sockets cat >> ~/.ssh/config << EOF Host login.rc* ControlMaster auto ControlPath ~/.ssh/sockets/%r@%h:%p EOF Mount Drive http://macfusionapp.org/ Symbolic links /project, /scratch
Software
less general Software support Supported software RC expertise select state-of-the-art software Installation, verification, and training Unsupported software Installation user expertise Consulting Advice on installing your software and any dependancies
Environment To run an executable, you need to know where it is. /opt/openmpi/1.4.4/bin/mpicxx /opt/mpitch2/1.5a2/bin/mpicxx Which one does the command which mpicxx use? PATH What about libraries? /opt/openmpi/1.4.4/lib/libmpi.so /opt/mpitch2/1.5a2/lib/libmpi.so LD_LIBRARY_PATH
Dotkit Manages your environmental variables use list packages in use use -a list hidden packages in use use <package_name> add a package to environment unuse <package_name> remove package from environment use -la list available packages use -la <term> list packages that contain <term>
Examples use NCAR-Parallel-Intel bash-janus> echo $PATH /curc/tools/free/redhat_5_x86_64/parallel-netcdf-1.2.0_openmpi-1.4.5_intel-12.1.4/bin /curc/tools/free/redhat_5_x86_64/openmpi-1.4.5_intel-12.1.4/bin /curc/tools/free/redhat_5_x86_64/torque-2.5.8/bin /curc/tools/free/redhat_5_x86_64/netcdf-4.1.3_intel-12.1.4_hdf-4.2.6_hdf5-1.8.8_openmpi-1.4.5/bin /curc/tools/free/redhat_5_x86_64/hdf5-1.8.8_openmpi-1.4.5_intel-12.1.4/bin /curc/tools/nonfree/redhat_5_x86_64/intel-12.1.4/composer_xe_2011_sp1.10.319/bin/intel64 /curc/tools/free/redhat_5_x86_64/sun_jdk-1.6.0_23-x86_64/bin /curc/tools/free/redhat_5_x86_64/hdf-4.2.6_ics-2012.0.032/bin /curc/tools/free/redhat_5_x86_64/szip-2.1/bin /curc/tools/nonfree/redhat_5_x86_64/moab-6.1.5/bin
Building Software I need the Boost C++ library for my software. Where should I build this? /home/molu8455/projects/software/boost/1.49.0 Build on a compute node (e.g. qsub -I) Ideas Consider sharing this with your group. How about your own dotkit?
Build your own dotkit cat $HOME/.kits/TeachingHPC.dk #c Teaching HPC #d This contains the libraries I use for teaching HPC: #d.openmpi-1.4.3_gcc-4.5.2_torque-2.5.8_ib #d.hdf5-1.8.6 # Dependencies dk_op -q.torque-2.5.8 dk_op -q.openmpi-1.4.3_gcc-4.5.2_torque-2.5.8_ib dk_op -q.hdf5-1.8.6 # Variables dk_alter HDF5_DIR /curc/tools/free/redhat_5_x86_64/hdf5-1.8.6 dk_alter BOOST_ROOT /home/molu8455/projects/software/boost/1.49.0 dk_alter LD_LIBRARY_PATH /home/molu8455/projects/software/boost/ 1.49.0/lib
Resource Management
Scheduling 7 6 Nodes 5 4 Time 3 2 1
Scheduling 3 5 6 Nodes 1 2 4 7 Time
Moab and Torque Moab Brains of the operation Comes up with the schedule Torque Reports information to Moab Receives direction from Moab Handles users requests Provide job query facilities
Commands showq -u <username> canceljob <job_id> or ALL checkjob <job_id> qsub showstart <job_id> showq -u <username> Show jobs in the queue Cancel your job(s) Information about your job submit jobs When will your job start? Show jobs in the queue
qsub Request a resource for your job 1) batch or 2) interactive Makes environmental variables available to your job PBS_O_* PBS_O_WORKDIR PBS_NODEFILE Options -q <queue_name> -l <resource_list> -I interactive -N <name> -e <error_path> -o <output_path> -j <join_path>
Queues Name Nodes Max Time Node Sharing janus-debug 1-480 1 hour janus-short 1-480 4 hours janus-long 1-80 7 days janus-small 1-20 1 day janus-normal 21-80 1 day janus-wide 81-480 1 day
Running Jobs
Process How many processors do I need? Approximately how long will this take? showstart 1024@30:00 showstart 16@16:00:00 Nodes 4 2 Time Which queue best fits this criteria? Name Nodes Max Time Node Sharing janus-debug 1-480 1 hour janus-short 1-480 4 hours janus-long 1-80 7 days janus-small 1-20 1 day janus-normal 21-80 1 day janus-wide 81-480 1 day
Serial Jobs #!/bin/bash #PBS -N example_1 #PBS -q janus-debug #PBS -l walltime=00:05:00 #PBS -l nodes=1:ppn=1 #PBS -e errfile #PBS -o outfile cd $PBS_O_WORKDIR # run trial 1 of the simulator./simulator 1 > sim.1
Pack the node #!/bin/bash #PBS -N example_2 #PBS -q janus-debug #PBS -l walltime=0:00:30, nodes=1:ppn=12 cd $PBS_O_WORKDIR./simulator 1 > sim.1 &./simulator 2 > sim.2 &./simulator 3 > sim.3 &./simulator 4 > sim.4 &./simulator 5 > sim.5 &./simulator 6 > sim.6 &./simulator 7 > sim.7 &./simulator 8 > sim.8 &./simulator 9 > sim.9 &./simulator 10 > sim.10 &./simulator 11 > sim.11 &./simulator 12 > sim.12 & wait
Multi-node serial jobs? Consider using our load-balancing tool. https://www.rc.colorado.edu/tutorials/loadbalance #!/bin/bash #PBS -N example_1 #PBS -q janus-debug #PBS -l walltime=00:05:00 #PBS -l nodes=2:ppn=12 cd $PBS_O_WORKDIR. /curc/tools/utils/dkinit reuse LoadBalance mpirun load_balance -f cmd_lines./simulator 1 > sim.1./simulator 2 > sim.2./simulator 3 > sim.3./simulator 4 > sim.4./simulator 5 > sim.5./simulator 6 > sim.6./simulator 7 > sim.7./simulator 8 > sim.8./simulator 9 > sim.9./simulator 10 > sim.10..../simulator 2000 > sim.2000
MPI #!/bin/bash #PBS -N example_4 #PBS -q janus-debug #PBS -l walltime=0:10:00 #PBS -l nodes=3:ppn=12 cd $PBS_O_WORKDIR resuse.openmpi-1.4.5_intel-12.1.4 # run trial 1 of the simulator mpirun -np 36./simulator mpirun./simulator
Non-Uniform Memory Access (NUMA) Each socket has a dedicated memory area for high speed access Also has an interconnect to other sockets for slower access to the other sockets' memory memory control memory memory control memory
MPI OpenMP / High Memory #!/bin/bash #PBS -N example_5 #PBS -q janus-debug #PBS -l walltime=0:10:00 #PBS -l nodes=3:ppn=12 cd $PBS_O_WORKDIR. /curc/tools/utils/dkinit resuse.openmpi-1.4.5_intel-12.1.4 export OMP_NUM_THREADS=12 mpirun --bind-to-core --bynode --npernode 1./simulator export OMP_NUM_THREADS=6 mpirun --bind-to-socket --bysocket --npersocket 1./simulator
Summary Access Use control sockets for login Filesystem Build software in /projects/<username> Run your jobs in /lustre/janus_scratch/<user_name> Recover files with.snapshot Consider striping when using shared-file access. Data Transfer Large files: Globus Online, Grid FTP Smaller files: sftp, scp
Software Build on compute node. Manage environment with your own dotkits. Resource Management Familiarize yourself with the queues When you have choices... showstart Running Jobs Request what you need and manage with LoadBalance OpenMP: be aware of NUMA Limit the number of processes per node for hybrid and high memory
Questions?
Collective buffering At large core counts, I/O performance can be hindered by: MDS contention (file-per-process) file system contention (shared-file) Use a subset of application processes to perform I/O. limits the number of files (file-per-process) limits the number of processes accessing file system resources (shared-file). Offloads work from the file system to the application A subset of processors write - reducing contention