Guillimin HPC Users Meeting. Bryan Caron

July 17, 2014 Bryan Caron bryan.caron@mcgill.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada

Outline Compute Canada News Upcoming Maintenance Downtime in August Storage System News Scheduler Updates Software and User Environment Updates Training News 2

Reminder: Compute Canada SPARC (Sustainable Planning for Advanced Research Computing) consultation process with the research community to build a national plan for advanced computing, data storage and archiving requirements targeted for CFIs planned renewal of Compute Canada infrastructure as well as funding for domain specific data projects analysis of big data software with large memory footprint specialized hardware (e.g. GPU accelerators, ) analysis of sensitive private data single systems with very large numbers of cores dedicated software platforms, gateways, and cloud VMs white papers due from research communities by July 31, 2014. More info: www.computecanada.ca Compute Canada News 3

Maintenance Downtime Guillimin Maintenance Downtime: August 11-15 Maintenance outage to the data centre cooling distribution system and electrical supply additional maintenance actions will be performed: System updates (InfiniBand drivers and tunings) Scheduler updates (patch version update) GPFS Storage updates (patch versions and tunings) Network configuration updates (40 GigE link into full production) Will require stoppage of all logins, data access and batch job activities starting at 8:00 am on Monday August 11 with services restored on Friday August 15 Status of system: www.hpc.mcgill.ca and Twitter 4

Storage System News GPFS Stability Issues - Update Regular occurrence of GPFS stability on nodes due to node expels Typical impact: interruption or halt of writing from jobs Recall: Latest Actions (June 11): 2nd update to all node IB network tunings significant improvements in stability present since that time with less than ~1-2 expels every few days Additional GPFS version updates and tunings to be applied during August 11-15 downtime will provide additional performance and stability at the communication layers used by GPFS Separate Incident: IB Core Switch Module Failure (July 1-4) core switch module failure resulted in communication issues affecting GPFS stability until July 4 - replaced on July 9 with affected nodes returned to full service additional IB module to be replaced in August downtime as preventative maintenance 5

Storage System News Reminder: Upcoming Activities Online expansion of /gs to full target size (~ 2.9PB) to be performed during August 11-15 maintenance downtime prerequisite of storage system scrub completed Tape Archive (Backup) and Hierarchical Storage Management (HSM) Integration All home directories backed-up nightly via TSM - started June 26 Access to tape for targeted backups for projects (In Test) Migration of scratch policy to use HSM rules for identification and cleanup (In Progress) Analyzing characteristics of file system contents to identify suitable HSM migration policies (In Progress) 6

Scheduler Update In general improved overall stability and performance A few outstanding issues under review with Adaptive Computing Testing in development environment with update to Torque 4.2.8 in progress Full update to Torque 4.2.8 planned for August 11-15 maintenance window Recall: April 10 - qsub for job submission enabled Default PATH settings updated to include Torque commands (qsub, qstat, ) Much faster response for submissions, queries compared to Moab commands (msub, canceljob, ) qsub submission filter: qsub A <RAPid> required for proper accounting and priority assignment (will be relaxed later) 7

Scheduler Update Job submission documentation updated www.hpc.mcgill.ca Documentation Submitting Your Job With migration to CentOS 6 nodes are set to new scheduler In the default queue, the chosen node depends on the pmem (memory) PBS parameter or node feature (ie. m256g, m512g, ) Internal routing for short jobs in default queue 8

Scheduler Update Default Queue - Serial Jobs(nodes=1:ppn=n, n<12) (new:sw2) Default Queue - Parallel Jobs (new:higher walltime boundary) 9

Scheduler Update Extra large memory nodes (XLM2) Alternative to ScaleMP (offline, to be reimaged to CentOS-6) Some nodes reserved by CFI grant holders, others get 12 hours only 10

Scheduler Update Prototype new version of cluster utilization graphs: http://tinyurl.com/guillimin-graph-prototype Please direct comments and/or suggestions to: guillimin@calculquebec.ca 10

Software Update New Installations NCVIEW/2.1.2-gcc espresso/5.1 (Quantum Espresso) cmake/2.8.12.2 12

Training News See Training at www.hpc.mcgill.ca for our full calendar of training and workshops for 2014 and to register all materials from previous workshops are available online Upcoming: August 19 - Scientific Visualization Tools originally scheduled for August 17 Recently Completed: July 10 - MapReduce and Hadoop for Big Data June 5 - Advanced OpenMP May 22 - Introduction to the Xeon Phi 13

User Feedback and Discussion Questions? Comments? We value your feedback. Guillimin Operational News for Users Follow us on Twitter: http://twitter.com/mcgillhpc 14