High Performance Computing Center HPCC User Group Meeting Planned Update of Quanah and Hrothgar Eric Rees Research Associate - HPCC June 6, 2018
HPCC User Group Meeting Agenda 6/6/2018 HPCC User Group Meeting Agenda State of the Clusters Updating the Clusters Shutdown Schedule Getting Help Q&A
HPCC User Group Meeting State of the Clusters
Updating the Clusters Quanah Cluster Commissioned in 2017 Updated in Q4 2017 Consists of 467 Nodes 16,812 Cores 87.56TB Total RAM Xeon E5-2695v4 Broadwell Processors Omnipath (100 Gbps) Fabric Hrothgar Cluster Commissioned in 2011 Updated in 2014 Downgraded in Q4 2017 Consists of 455 Nodes (33 Nodes in CC) 6,528 Cores (660 in CC) A mixture of Xeon X5660 Westmere Processors Xeon E5-2670 Ivy Bridge Processors DDR & QDR (20 Gbps) Infiniband Fabric
Cluster Utilization Primary Queues Only
Cluster Updates Previous Cluster Updates Q2 2017 Commissioned Quanah for general use Q4 2017 Upgraded Quanah / Downgraded Hrothgar Switched to OpenHPC/Warewulf for node provisioning on Hrothgar Isolated the storage network Invested significant time into stabilizing the Omnipath network (Quanah) Q1 2018 Invested significant time into stabilizing the Infiniband network (Hrothgar) Invested in a high speed line (10gb) to the Chemistry server room. Future Cluster Updates Early Q3 2018 (July) Updating the Scheduler to UGE 8.5.5 Updating the Scheduler s policies Unifying the Quanah, Hrothgar and CC environment (OS, installed software, containers) Stabilize Hrothgar nodes by updating to CentOS 7 Increase the size of Hrothgar s serial queue by adding 239 nodes (2,868 cores) to it Update the Software Installation Policy Late Q3 2018 Replace the Hrothgar UPS (Tentative) Early-Mid Q4 2018 Commission the new HPCC generator
HPCC User Group Meeting Updating the Clusters
Updating the Clusters Three major changes will take place during the July shutdown: Update the Scheduler Update the version to 8.5.5 Switch to a share-tree based policy Update all nodes to CentOS 7.4 Bring all clusters into the same environment Update the Software Installation Policy
HPCC User Group Meeting Updating the Scheduler
Updating the Schedule The following changes will be made to the Omni queue: Updating to a new version of the Univa scheduler (8.5.5) Parallel environment fill (-pe fill) will be removed Switching to a share-tree based policy Adding two new projects (xlquanah and hep) Implementing new features (JSV & RQS) Implementing memory constraints
Updating the Schedule We originally were on a share based policy. HPCC encountered a new bug in the UGE Scheduler Our systems couldn t share resources fairly due to the coding error Univa was unable to determine the source of the problem Univa s Resolution: Perform a fresh re-install of the UGE Scheduler How does UGE assign priority? weight deadline prio = weight priority pprio + weight urgency normalized hrr + weight waitingtime waitingtimeinseconds + timeremaininginseconds + weight ticket normalized ftckt + otckt + stckt
Switching to a share-tree based policy The updated share tree policy will cause the following: Max priority value: 2.0001 Cluster usage will be the primary contributor to the number of share tickets you receive Share is based on the number of core hours you have used, are currently using, and are projected to use based on currently waiting jobs. Your share (as a value) is halved every 7 days. Run 10 core hour job today, 7 days counts as 5, 14 days counts as 2.5. Waiting time / Deadline Time / Slot Urgency will weigh in very little.
Switching to a share-tree based policy Expected outcomes Usage will now play heavily into a user s share of the cluster: Heavy users will take a hit to their job priority and run less often Moderate users will likely get a boost to their job priorities and run more often. Minor/Rare users will likely not notice any differences Example: User A submits two 3600 core jobs. User B submits two hundred 36 core jobs. A will get to run 1 job, then B will get to run 100 jobs before A is granted their second job.
Additional Projects The omni queue will now contain 3 projects: quanah Default: Project quanah will be able to use all 16,812 cores, maximum 48 hour run time and will be able to use the sm and mpi parallel environments. xlquanah Project xlquanah will be able to use 144 cores, maximum of 120 hour run time and will be able to only use the sm parallel environment. hep Project hep will be able to use 720 cores with no other restrictions. Only available to the High Energy Physics user group.
Implementing New Features The omni queue will now make use of the following: Resource Quota Sets (RQS) The RQS will act as an enforcer of some job specifications. Primarily the RQS will help enforce run-time limits on jobs. Job Submission Verifier (JSV) This script will run after you submit a qsub but before your job is officially submitted. The JSV will either accept your job, make any necessary changes to your job then accept it, or reject your job.
Job Submission Verifier JSV for the omni queue The JSV will do the following Ensure only jobs that could possibly run actually get accepted. (Accept, Correct, or Reject) Ensure time limits, memory and other constraints are enforced. Try to prevent the most common reasons for jobs to get hung in qw Requesting >36 cores for -pe sm or non-multiple of 36 for -pe mpi Enforce resource time requests: 48 hours on quanah 120 hours on xlquanah
Job Submission Verifier JSV for the omni queue (continued) Enforce memory constraints It not defined by user, then calculate (# of requested slots * ~5.3GB) and set it in the job. If requested memory size is greater than total # of requested nodes, then reject the job. If the job is serial (sm) the maximum requested memory should not exceed 192GB.
Implementing Memory Constraints Memory Constraints All jobs will now have a set amount memory they are limited to. Users can define their memory resources using -l h_vmem=<int>[m G] Example for requesting 6 GB of memory: -l h_vmem=6g How do memory constraints work? Omni queue will make use of soft memory constraints. Any job that goes over its requested maximum memory will be killed only if there is memory contention. Example 1: Your job requested 10GB and uses 11GB. No other user is on the system so your job continues. Example 2: Your job requested 10GB and uses 11GB. All remaining memory has been given to other jobs on the same node (memory contention) so your job is killed.
HPCC User Group Meeting Updating the Environment
Updating the Environment The following changes will be made to all nodes: Operating System will be brought up to CentOS version 7.4 Quanah is currently CentOS 7.3 Hrothgar is currently CentOS 6.9 All HPCC installed software will now be converted to RPMs and installed locally on the nodes Currently all software is installed from the NFS servers Modules will now be the same for all clusters Singularity containers will be available for use on all clusters
Updating the Environment Why change the environment? Stability OpenHPC does not support CentOS 6.9. As such, Hrothgar has been less stable than Quanah. This update will improve the stability of Hrothgar. Migrating to a common OS will make switching between clusters easier and reduce the number of compilations users may need to perform. Security Some Spectre and Meltdown protections will be implemented by moving to CentOS 7.4.
Updating the Environment Why change the environment? Speed Migrating to a common OS allows us to install software directly to the nodes in an automated fashion. Running applications locally will improve the runtime of many commonly used applications.
Updating the Environment What does this mean for me? Software may require re-compilation This will primarily apply to Hrothgar, however some Quanah applications may need it as well You will be able to run containers on Hrothgar just as you do on Quanah Module paths will be the same on Quanah and Hrothgar This will making switching between the clusters easier This will resolve the problems with module spider you see on Hrothgar You will be able to submit jobs to West or Ivy from any node no longer restricted to Hrothgar for West and Ivy for Ivy.
HPCC User Group Meeting July Shutdown Schedule
July Shutdown Shutdown Timeline July 6 July 9 July 10 July 11 5:00 PM Disable and begin draining all queues Shut down all clusters Upgrade the OS Install the new scheduler Test new scheduler and OS updates Continue testing and debugging any issues Complete testing 5:00 PM - Reopen clusters for general use
Quanah, Hrothgar and Shutdown Q&A Session When will the HPCC fully shutdown? Approximately 8:00 am on July 9, 2018 Will the community clusters change? Yes, your systems will be upgraded from CentOS 6.9 to CentOS 7.4 Software may require recompilation No, job submissions, controlled access and current resources will stay the same Will my data be affected? No, we do not anticipate any unintended alteration or loss of data stored on our storage servers
Quanah, Hrothgar and Shutdown Q&A Session When will there be another purge of /lustre/scratch? Next purge will occur before the shutdown Email explaining which files will be deleted will go out soon If you have data you can t afford to lose, then back it up! Can I buy-in to Quanah like Hrothgar? Yes. For efficiency, we are moving to class of service rather than isolated dedicated resources for the "community cluster shared service Buying in gives you priority access to your portion of the Quanah cluster Unused compute cycles will be shared among other users You still have ownership of definite machines for capital expenditure and inventory purposes
Quanah, Hrothgar and Shutdown Q&A Session Can I purchase storage space? Yes, we are considering several options for storage based on relative needs for speed, reliability, and potential optional backup. Developing a Research Data Management System (RDMS) service in conjunction with the TTU Library. For more information, please contact Dr. Alan Sill (alan.sill@ttu.edu) Dr. Eric Rees (eric.rees@ttu.edu)
Quanah, Hrothgar and Shutdown Q&A Session Can I access my data during the shutdown? Yes, the Globus Connect endpoint (Terra) will remain online during the shutdown. Can I move my data before the shutdown Yes, please use Globus Connect See HPCC User Guides or visit: http://tinyurl.com/hpcc-data-transfer Any questions before we continue?
HPCC User Group Meeting User Engagement
User Engagement HPCC Seminars In the process of planning these out. Monthly or Bi-monthly during the long semesters Meetings will include Presentations from TTU researchers of research performed (in part) using HPCC resources. News regarding the HPCC Planned shutdowns Planned changes HPCC User Training Restarting the HPCC user training courses General / New User training courses will be offered during the 2 nd and 4 th week of each long semester and the 1 st and 2 nd week of each summer semester. Training courses on topics of interest or advanced topics will occur as interest or need arises. Surveys regarding user needs and current or future HPCC services.
HPCC User Training Topics Next HPCC User Training Date: June 12 th Topic: New User Training Future Training Topics General / New User Training Training on topics of interest Advanced Job Scheduling Singularity Containers (Building and Using)
HPCC User Group Meeting Getting Help
Getting Help Best ways to get help Visit our website - hpcc.ttu.edu Most user guides have been updated New user guides are being added Submit a support ticket Send an email to hpccsupport@ttu.edu
July Shutdown Shutdown Timeline July 6 July 9 July 10 July 11 5:00 PM Disable and begin draining all queues Shut down all clusters Upgrade the OS Install the new scheduler Test new scheduler and OS updates Continue testing and debugging any issues Complete testing 5:00 PM - Reopen clusters for general use What should you do to prepare? Ensure any jobs you wish to run are queued well before the shutdown Test your Hrothgar applications using the serial queue (which has been converted to CentOS7) Instructions will be available on our website soon - hpcc.ttu.edu Qlogin to the Hrothgar serial queue and ensure any required modules are still available Questions? Comments? Concerns? Email me at eric.rees@ttu.edu or send an email to hpccsupport@ttu.edu