June 19, 2014 Bart Oldeman bart.oldeman@mcgill.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada
Outline Compute Canada News Upcoming Maintenance Downtime in August Storage System News Scheduler Updates and Demonstration Software and User Environment Updates Training News New Visualization and Collaboration Environment
Compute Canada News Compute Canada SPARC (Sustainable Planning for Advanced Research Computing) consultation process with the research community to build a national plan for advanced computing, data storage and archiving requirements targeted for CFIs planned renewal of Compute Canada infrastructure as well as funding for domain specific data projects consultations (white papers, workshops) in summer to prepare a preliminary plan for November 2014 renewal plan due April 2015 Notices of Intent for domain proposals due Jan. 2015 More info: www.computecanada.ca
Maintenance Downtime Guillimin Maintenance Downtime: August 4-7 Maintenance outage to the data centre cooling distribution system Will require stoppage of all logins, data access and batch job activities Further information regarding the planned maintenance downtime will be distributed by middle of July.
Storage System News GPFS Stability Issues - Update Regular occurrence of GPFS stability on nodes due to node expels Typical impact: interruption or halt of writing from jobs Investigation with GPFS and IB support team ongoing with critical priority Latest Actions (June 11): 2nd update to all node IB network tunings additional increase in receive queue size for IP-over-IB communications across the much larger scale IB fabric Continue to observe significant decrease in number of node expels (~1-2 every few days - major stability improvement) additional investigations to further improve performance
Storage System News Reminder: Upcoming Activities Online expansion of /gs to full target size (~ 2.9PB) Tape Archive (Backup) and Hierarchical Storage Management (HSM) Integration Migration of scratch policy to use HSM rules for identification and cleanup (In Progress) Analyzing characteristics of file system contents to identify suitable HSM migration policies (In Progress) Access to tape for targeted backups (In Progress)
Scheduler Update In general improved overall stability and performance A few outstanding issues under review with Adaptive Computing Testing in development environment with update to Torque 4.2.8 in progress Recall: April 10 - qsub for job submission enabled Default PATH settings updated to include Torque commands (qsub, qstat, ) Much faster response for submissions, queries compared to Moab commands (msub, canceljob, ) qsub submission filter: qsub A <RAPid> required for proper accounting and priority assignment (will be relaxed later)
Scheduler Update Job submission documentation updated www.hpc.mcgill.ca Documentation Submitting Your Job With migration to CentOS 6 nodes are set to new scheduler In the default queue, the chosen node depends on the pmem (memory) PBS parameter or node feature (ie. m256g, m512g, ) Internal routing for short jobs in default queue
Scheduler Update Default Queue - Serial Jobs(nodes=1:ppn=n, n<12) (new:sw2) Default Queue - Parallel Jobs (new:higher walltime boundary) () default if procs > 12 or nodes > 1 (which need to communicate over IB) () default if procs = 12 or nodes=1:ppn=12
Scheduler Update Extra large memory nodes (XLM2) Alternative to ScaleMP (offline, to be reimaged to CentOS-6 next week) Some nodes reserved by CFI grant holders, others get 12 hours only Example PBS submission lines: #PBS -l nodes=1:ppn=16,pmem=11700m,walltime=10:00:00 #PBS -l nodes=1:ppn=16,pmem=11700m,walltime=1:00:00:00 #PBS -l nodes=1:ppn=1,pmem=31700m,walltime=10:00:00 #PBS -l nodes=1:ppn=16,pmem=31700m,walltime=10:00:00 #PBS -l nodes=1:ppn=16,pmem=31700m,walltime=1:00:00:00 #PBS -l nodes=4:ppn=16,pmem=31700m,walltime=1:00:00:00 #PBS -l nodes=1:ppn=32:m1024g,pmem=31700m #PBS -l nodes=1:ppn=16:m256g,pmem=15700m (any XLM2 node) (non-cfi nodes only) (serial on m512g/m1024g) (16 cores on m512g/m1024g) (16 cores on m512g: non-cfi) (all cores on m512g nodes) (specific node type, IF you are the CFI holder) (specific node type)
Scheduler Update Examples: why is my job not running yet? checkjob -v JOB_ID (can also use -v -v, etc.) showq showq -i -v showq -r -v showq -w class=<queue_name> showq -w class=hb showq -w class=hbplus showq -w class=hb -r showq -w class=hbplus -r showq -w class=hb -i
Software Update New Installations petsc/3.4.4-openmpi-1.6.3-{gcc,intel} h5py for python/2.7.3 GPU Updates Driver update completed: NVIDIA-Linux-x86_64-331.67 Update to /etc/bashrc on GPU nodes to allow for correct operation of the NVIDIA Profiler MDCS and Matlab Update April 22 - license manager migrated to CentOS 6 Now supports up to 2014a Includes update to standard Matlab license for McGill users (access restricted due to Mathworks license requirements)
Software Update Compiler Updates / Additions to come Intel 14.0.2 License manager migration required to support newer Intel installations Long-term: project to standardize modules across Calcul Québec Others in progress MIO2/1.0 modular I/O library from IBM Research IOBUFF from Calcul Québec
Training News See Training at www.hpc.mcgill.ca for our full calendar of training and workshops planned for 2014 and to register Upcoming: July 10 - MapReduce and Hadoop for Big Data August 17 - Scientific Visualization Tools Recently Completed: June 5 - Advanced OpenMP May 22 - Introduction to the Xeon Phi
New Visualization & Collaboration Environment Located at the McGill HPC Centre at ETS (Peel and Notre-Dame O.) Polycom Group 700 HD series multi-point conferencing unit Two 55 LED LCD and one 65 LED LCD screens Crestron AirMedia for wireless connectivity room capacity of 10-15 people Room is available for video-conferencing or data visualization Contact us at guillimin@calculquebec.ca to access this resource
Other Developments Work has started on the summer upgrade of network link from data centre at ETS to McGill core network Upgrade from 10 to 40 Gbps Will include 10 Gbps connection to the Calcul Québec router Upgrade will enable support for projects requiring additional dedicated network bandwidth in/out from the data centre Testing to be completed in July and in production by end of August
User Feedback and Discussion Questions? Comments? We value your feedback. Guillimin Operational News for Users Follow us on Twitter: http://twitter.com/mcgillhpc