Guillimin HPC Users Meeting February 11, McGill University / Calcul Québec / Compute Canada Montréal, QC Canada

Guillimin HPC Users Meeting February 11, 2016 guillimin@calculquebec.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada

Compute Canada News Scheduler Updates Software Updates Training News Special Topic How cold is your data? Cold data and Storage Management on Guillimin Outline 2

Compute Canada News RAC and RPP 2016 Competition Results Implementation of new allocation settings for compute and storage has been set in January New: most 2015 CPU allocations are updated with the same RAPI (xyz-123-ab) for 2016 Storage quotas for very large allocations are increased progressively We have already started to decrease some quotas for groups that have a reduced allocation from 2015 to 2016, or have no specific storage allocation for 2016 3

System Status IB network and GPFS issues January 28: InfiniBand hardware issue in one switch Cascade of GPFS communication issues Lots of failed jobs; we temporarily paused the scheduler February 4: InfiniBand network structure (fabric) issue GPFS partitions lost on all nodes Needed a complete restart of IB network and GPFS: we temporarily pause the scheduler, drain compute nodes and drain all login nodes February 9: all VMs and license servers back online 4

Scheduler Update The scheduler is mostly stable. Only occasionally it hangs or crashes. Monitoring scripts are in place to restart the server in case of unresponsiveness: please wait 10 minutes if qsub, qstat, etc., fail. Other monitoring scripts automatically take nodes offline in case of issues, including every time a new job is about to start. If the node has issues (GPFS, disk full, RAM, etc.) the job will run elsewhere. 5

Software Update New Installations PGI/15.10 (via Lmod only) GAMESS-US/20141205-R1 (via Lmod only) New new Lmod/EasyBuild based module structure: Now backwards compatible, opt-in by doing: touch ~/.lmod_legacy Default on March 15, opt-out via ~/.lmod_disabled Old modulefiles keep working, including those in $HOME/modulefiles Most new modulefiles accessed via: module load iomkl/2015b (loads GCC 4.9.3+Intel 15.0.3+OpenMPI 1.8.8+MKL); see http://www.hpc.mcgill.ca/index.php/starthere/81-doc-pages/88-guillimin-modules 6

Training News See Training and Outreach at www.hpc.mcgill.ca for our calendar of training and workshops for 2016 and to register All materials from previous workshops are available online. See also: https://wiki.calculquebec.ca/w/formations/en Suggestions for training? Please let us know! Upcoming: calculquebec.eventbrite.ca February 15 - Introduction à Linux et au CIP (U. Montréal) February 17 - Introduction to Python (McGill U.) February 22 - Introduction à MPI (U. Montréal) Recently Completed: February 3 - Introduction to Linux (McGill U.) January 28 - Introduction to ARC (McGill U.) 7

User Feedback and Discussion Questions? Comments? We value your feedback. Contact us at: guillimin@calculquebec.ca Guillimin Operational News for Users Status Pages http://www.hpc.mcgill.ca/index.php/guillimin-status http://serveurscq.computecanada.ca (all CQ systems) Follow us on Twitter http://twitter.com/mcgillhpc 8

How cold is your data? Cold data and Storage Management on Guillimin February 11, 2016 McGill University / Calcul Québec / Compute Canada Montréal, QC Canada

How cold is your data? Outline: What is cold data? Why is this an issue? How can we deal with it? 10

What is cold data? Cold data: any file not accessed for a long time. Access: any read or write operation On Guillimin: We scan the project spaces on all file systems to identify the time of last access and size for all files Unfortunately, no, we cannot cool Guillimin with cold data (but that would be great!, especially in summer) So, yes, we have to cool disks containing cold data But, no, cooling disks is not causing cold data ;-) 11

Distribution of cold data TB 12

Distribution of cold data TB 13

Cold Data Problem - Solution Why is this an issue? RAC requests far exceed capacity /sb: Capacity = 456 TB Allocations + home dir. = 546 TB (120%) /gs: Capacity = 3097 TB Allocations on /gs + scratch dir. = 4207 TB (135%) Not all data has the same temperature! How can we deal with it? Move files between different tiers (classes) of storage so as to provide disk space for active data 14

HSM Hierarchical Storage Management What it does: While keeping a single view of the file system (GPFS), it will physically move data across multiple tiers: SSDs (not on Guillimin) Disks (/sb, /gs) Tapes Stub Files (depending on implemented policies): First 1MB of data is kept on first tier (disk) The remaining data is moved to the next tier (tape) Metadata remain accessible from the first tier (disk) 15

What does it mean? From an administration and operations perspective project spaces are scanned to identify the size and time of last access for all files Initial selection criteria: Access time and modification time older than 1 year Only files that are larger than 10MB Selected data blocks are moved to the tape system From {/gs, /sb} to a hard-drive buffer, then to tape From a user s perspective File read and write access is blocking The terminal or any job process hangs temporarily By that time: the data recall is queued, the robot puts the tape in a tape-drive, the drive seeks to the data, and the data is moved back to disk 16

Tests that have been done First test: simultaneously recall five 4GB files a) First read action (diff) has waited 2 to 3 minutes b) Reading all five files (start time) with Python code: File 1: delay 0s (because already on GPFS from test a)) File 2: delay 42s File 3: delay 52s File 4: delay 15s (sequence: files 1, 4, 2, 3, 5) File 5: delay 60s 17

Tests that have been done Second test: simultaneously recall eight 4GB files With Python code (test was started at second 0): File 6: from second 122 to 128 File 1: from second 155 to 161 File 2: from second 166 to 172 File 8: from second 185 to 191 File 7: from second 205 to 211 File 5: from second 219 to 225 File 4: from second 241 to 247 File 3: from second 262 to 268 Good news: no timeout error message! 18

Tests that have been done Third test: simultaneously recall eight 4GB files With C code (test was started at second 0): File 1: second 31 to 33 File 7: second 44 to 45 File 3: second 65 to 67 File 5: second 73 to 74 File 8: second 83 to 86 File 6: second 91 to 92 File 4: second 114 to 115 File 2: second 132 to 134 C vs Python: when all files are on disk, C code reads them in 17 seconds, and Python code, in 18 seconds 19

Current status of deployment Status: All project spaces have been scanned for candidates A few selected project spaces have been partly migrated Still fine tuning and testing further the migration system New prquota in development Will report data usage both on disk and on tape The total permitted usage will be equal to the allocation size The disk quota will be set dynamically according to the amount of data on tape 20

Conclusion For any other questions: guillimin@calculquebec.ca 21