HPCC User Group Meeting

Similar documents
HPCC New User Training

HPCC - Hrothgar Getting Started User Guide Gromacs

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

Subject: Request for proposal for a high-performance computing cluster

Graham vs legacy systems

Getting started with the CEES Grid

OBTAINING AN ACCOUNT:

Comet Virtualization Code & Design Sprint

Genius Quick Start Guide

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Server Virtualization and Optimization at HSBC. John Gibson Chief Technical Specialist HSBC Bank plc

Parallel Computing at DESY Zeuthen. Introduction to Parallel Computing at DESY Zeuthen and the new cluster machines

WVU RESEARCH COMPUTING INTRODUCTION. Introduction to WVU s Research Computing Services

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

Name Department/Research Area Have you used the Linux command line?

RGB: Redfish Green500 Benchmarker

INTRODUCTION TO THE CLUSTER

Planning a Successful OS/DB Migration

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

High Performance Computing Resources at MSU

InfiniBand and Mellanox UFM Fundamentals

HTCondor on Titan. Wisconsin IceCube Particle Astrophysics Center. Vladimir Brik. HTCondor Week May 2018

BRC HPC Services/Savio

Answers to Federal Reserve Questions. Training for University of Richmond

X Grid Engine. Where X stands for Oracle Univa Open Son of more to come...?!?

Request for Proposals (RFP) IT Infrastructure Upgrade & Storage Solution January 26, 2017 Proposal Deadline: February 8, 2017

Answers to Federal Reserve Questions. Administrator Training for University of Richmond

Introduction to the Cluster

Introduction to HPC Using zcluster at GACRC

Guillimin HPC Users Meeting. Bart Oldeman

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

LEVERAGING A PERSISTENT HARDWARE ARCHITECTURE

Introduction to NCAR HPC. 25 May 2017 Consulting Services Group Brian Vanderwende

Pivotal Greenplum Database Azure Marketplace v4.0 Release Notes

Guillimin HPC Users Meeting April 13, 2017

Cloud Computing For Researchers

Welcome to Hitachi Vantara Customer Support for Pentaho

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

Getting Started with Amazon EC2 and Amazon SQS

High Performance Computing in C and C++

Introduction to GALILEO

Shark Cluster Overview

Introduction to High-Performance Computing (HPC)

Huawei FusionCloud Desktop Solution 5.1 Resource Reuse Technical White Paper HUAWEI TECHNOLOGIES CO., LTD. Issue 01.

To Upgrade or to Re-implement Dynamics NAV. Presented by: Abhishek Agnihotri

XSEDE New User Tutorial

SuperMike-II Launch Workshop. System Overview and Allocations

About the XenClient Enterprise Solution

HPCC - Hrothgar. Getting Started User Guide TotalView. High Performance Computing Center Texas Tech University

Guillimin HPC Users Meeting November 16, 2017

Intel Manycore Testing Lab (MTL) - Linux Getting Started Guide

Beam Software User Policies

How to Use a Supercomputer - A Boot Camp

CS 241 Data Organization using C

Guillimin HPC Users Meeting. Bryan Caron

The cluster system. Introduction 22th February Jan Saalbach Scientific Computing Group

Users and utilization of CERIT-SC infrastructure

Workload management at KEK/CRC -- status and plan

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

Introduction to the Cluster

HPC DOCUMENTATION. 3. Node Names and IP addresses:- Node details with respect to their individual IP addresses are given below:-

Actifio Test Data Management

Cloud Consolidation with Oracle (RAC) How much is too much?

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

June 26, Explanatory meeting for users of supercomputer system -- Overview of UGE --

Introduction to High Performance Computing and an Statistical Genetics Application on the Janus Supercomputer. Purpose

Elastic Compute Service. Quick Start for Windows

Edinburgh (ECDF) Update

Service Level Agreement Research Computing Environment and Managed Server Hosting

Introduction to High-Performance Computing (HPC)

Microsoft 365 powered device webinar series Microsoft 365 powered device Assessment Kit. Alan Maddison, Architect Amit Bhatia, Architect

UAntwerpen, 24 June 2016

How can you manage what you can t see?

Guillimin HPC Users Meeting March 16, 2017

MyCloud Computing Business computing in the cloud, ready to go in minutes

Advanced Topics in High Performance Scientific Computing [MA5327] Exercise 1

FHDA Communication Suite Action Items and Issues. 7/30/12 Moderate Vartan Project Team. 7/30/12 Moderate Rob Vartan, Jack

Unified Storage and FCoE

Specific topics covered in this brief include:

Department of Chemistry

HC3 Move Powered by Carbonite

Multiprocessor scheduling

Experiences in Optimizing a $250K Cluster for High- Performance Computing Applications

Redis for Pivotal Cloud Foundry Docs

Guillimin HPC Users Meeting December 14, 2017

IT Governance: Shared IT Infrastructure Advisory Committee (SIAC)

How to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions

Introduction to HPC Using zcluster at GACRC

Cluster Network Products

Outline. March 5, 2012 CIRMMT - McGill University 2

XSEDE New User Training. Ritu Arora November 14, 2014

OSG Lessons Learned and Best Practices. Steven Timm, Fermilab OSG Consortium August 21, 2006 Site and Fabric Parallel Session

New Business and/or Issues Verizon provided a status on its action items from the New Business discussion at the August CUF meeting.

VMware vcenter Site Recovery Manager disaster recovery best practices

HPCF Cray Phase 2. User Test period. Cristian Simarro User Support. ECMWF April 18, 2016

New User Seminar: Part 2 (best practices)

High Performance Computing (HPC) Using zcluster at GACRC

The RWTH Compute Cluster Environment

University of California, Riverside. Computing and Communications. Computational UCR. March 28, Introduction 2

Guillimin HPC Users Meeting February 11, McGill University / Calcul Québec / Compute Canada Montréal, QC Canada

Transcription:

High Performance Computing Center HPCC User Group Meeting Planned Update of Quanah and Hrothgar Eric Rees Research Associate - HPCC June 6, 2018

HPCC User Group Meeting Agenda 6/6/2018 HPCC User Group Meeting Agenda State of the Clusters Updating the Clusters Shutdown Schedule Getting Help Q&A

HPCC User Group Meeting State of the Clusters

Updating the Clusters Quanah Cluster Commissioned in 2017 Updated in Q4 2017 Consists of 467 Nodes 16,812 Cores 87.56TB Total RAM Xeon E5-2695v4 Broadwell Processors Omnipath (100 Gbps) Fabric Hrothgar Cluster Commissioned in 2011 Updated in 2014 Downgraded in Q4 2017 Consists of 455 Nodes (33 Nodes in CC) 6,528 Cores (660 in CC) A mixture of Xeon X5660 Westmere Processors Xeon E5-2670 Ivy Bridge Processors DDR & QDR (20 Gbps) Infiniband Fabric

Cluster Utilization Primary Queues Only

Cluster Updates Previous Cluster Updates Q2 2017 Commissioned Quanah for general use Q4 2017 Upgraded Quanah / Downgraded Hrothgar Switched to OpenHPC/Warewulf for node provisioning on Hrothgar Isolated the storage network Invested significant time into stabilizing the Omnipath network (Quanah) Q1 2018 Invested significant time into stabilizing the Infiniband network (Hrothgar) Invested in a high speed line (10gb) to the Chemistry server room. Future Cluster Updates Early Q3 2018 (July) Updating the Scheduler to UGE 8.5.5 Updating the Scheduler s policies Unifying the Quanah, Hrothgar and CC environment (OS, installed software, containers) Stabilize Hrothgar nodes by updating to CentOS 7 Increase the size of Hrothgar s serial queue by adding 239 nodes (2,868 cores) to it Update the Software Installation Policy Late Q3 2018 Replace the Hrothgar UPS (Tentative) Early-Mid Q4 2018 Commission the new HPCC generator

HPCC User Group Meeting Updating the Clusters

Updating the Clusters Three major changes will take place during the July shutdown: Update the Scheduler Update the version to 8.5.5 Switch to a share-tree based policy Update all nodes to CentOS 7.4 Bring all clusters into the same environment Update the Software Installation Policy

HPCC User Group Meeting Updating the Scheduler

Updating the Schedule The following changes will be made to the Omni queue: Updating to a new version of the Univa scheduler (8.5.5) Parallel environment fill (-pe fill) will be removed Switching to a share-tree based policy Adding two new projects (xlquanah and hep) Implementing new features (JSV & RQS) Implementing memory constraints

Updating the Schedule We originally were on a share based policy. HPCC encountered a new bug in the UGE Scheduler Our systems couldn t share resources fairly due to the coding error Univa was unable to determine the source of the problem Univa s Resolution: Perform a fresh re-install of the UGE Scheduler How does UGE assign priority? weight deadline prio = weight priority pprio + weight urgency normalized hrr + weight waitingtime waitingtimeinseconds + timeremaininginseconds + weight ticket normalized ftckt + otckt + stckt

Switching to a share-tree based policy The updated share tree policy will cause the following: Max priority value: 2.0001 Cluster usage will be the primary contributor to the number of share tickets you receive Share is based on the number of core hours you have used, are currently using, and are projected to use based on currently waiting jobs. Your share (as a value) is halved every 7 days. Run 10 core hour job today, 7 days counts as 5, 14 days counts as 2.5. Waiting time / Deadline Time / Slot Urgency will weigh in very little.

Switching to a share-tree based policy Expected outcomes Usage will now play heavily into a user s share of the cluster: Heavy users will take a hit to their job priority and run less often Moderate users will likely get a boost to their job priorities and run more often. Minor/Rare users will likely not notice any differences Example: User A submits two 3600 core jobs. User B submits two hundred 36 core jobs. A will get to run 1 job, then B will get to run 100 jobs before A is granted their second job.

Additional Projects The omni queue will now contain 3 projects: quanah Default: Project quanah will be able to use all 16,812 cores, maximum 48 hour run time and will be able to use the sm and mpi parallel environments. xlquanah Project xlquanah will be able to use 144 cores, maximum of 120 hour run time and will be able to only use the sm parallel environment. hep Project hep will be able to use 720 cores with no other restrictions. Only available to the High Energy Physics user group.

Implementing New Features The omni queue will now make use of the following: Resource Quota Sets (RQS) The RQS will act as an enforcer of some job specifications. Primarily the RQS will help enforce run-time limits on jobs. Job Submission Verifier (JSV) This script will run after you submit a qsub but before your job is officially submitted. The JSV will either accept your job, make any necessary changes to your job then accept it, or reject your job.

Job Submission Verifier JSV for the omni queue The JSV will do the following Ensure only jobs that could possibly run actually get accepted. (Accept, Correct, or Reject) Ensure time limits, memory and other constraints are enforced. Try to prevent the most common reasons for jobs to get hung in qw Requesting >36 cores for -pe sm or non-multiple of 36 for -pe mpi Enforce resource time requests: 48 hours on quanah 120 hours on xlquanah

Job Submission Verifier JSV for the omni queue (continued) Enforce memory constraints It not defined by user, then calculate (# of requested slots * ~5.3GB) and set it in the job. If requested memory size is greater than total # of requested nodes, then reject the job. If the job is serial (sm) the maximum requested memory should not exceed 192GB.

Implementing Memory Constraints Memory Constraints All jobs will now have a set amount memory they are limited to. Users can define their memory resources using -l h_vmem=<int>[m G] Example for requesting 6 GB of memory: -l h_vmem=6g How do memory constraints work? Omni queue will make use of soft memory constraints. Any job that goes over its requested maximum memory will be killed only if there is memory contention. Example 1: Your job requested 10GB and uses 11GB. No other user is on the system so your job continues. Example 2: Your job requested 10GB and uses 11GB. All remaining memory has been given to other jobs on the same node (memory contention) so your job is killed.

HPCC User Group Meeting Updating the Environment

Updating the Environment The following changes will be made to all nodes: Operating System will be brought up to CentOS version 7.4 Quanah is currently CentOS 7.3 Hrothgar is currently CentOS 6.9 All HPCC installed software will now be converted to RPMs and installed locally on the nodes Currently all software is installed from the NFS servers Modules will now be the same for all clusters Singularity containers will be available for use on all clusters

Updating the Environment Why change the environment? Stability OpenHPC does not support CentOS 6.9. As such, Hrothgar has been less stable than Quanah. This update will improve the stability of Hrothgar. Migrating to a common OS will make switching between clusters easier and reduce the number of compilations users may need to perform. Security Some Spectre and Meltdown protections will be implemented by moving to CentOS 7.4.

Updating the Environment Why change the environment? Speed Migrating to a common OS allows us to install software directly to the nodes in an automated fashion. Running applications locally will improve the runtime of many commonly used applications.

Updating the Environment What does this mean for me? Software may require re-compilation This will primarily apply to Hrothgar, however some Quanah applications may need it as well You will be able to run containers on Hrothgar just as you do on Quanah Module paths will be the same on Quanah and Hrothgar This will making switching between the clusters easier This will resolve the problems with module spider you see on Hrothgar You will be able to submit jobs to West or Ivy from any node no longer restricted to Hrothgar for West and Ivy for Ivy.

HPCC User Group Meeting July Shutdown Schedule

July Shutdown Shutdown Timeline July 6 July 9 July 10 July 11 5:00 PM Disable and begin draining all queues Shut down all clusters Upgrade the OS Install the new scheduler Test new scheduler and OS updates Continue testing and debugging any issues Complete testing 5:00 PM - Reopen clusters for general use

Quanah, Hrothgar and Shutdown Q&A Session When will the HPCC fully shutdown? Approximately 8:00 am on July 9, 2018 Will the community clusters change? Yes, your systems will be upgraded from CentOS 6.9 to CentOS 7.4 Software may require recompilation No, job submissions, controlled access and current resources will stay the same Will my data be affected? No, we do not anticipate any unintended alteration or loss of data stored on our storage servers

Quanah, Hrothgar and Shutdown Q&A Session When will there be another purge of /lustre/scratch? Next purge will occur before the shutdown Email explaining which files will be deleted will go out soon If you have data you can t afford to lose, then back it up! Can I buy-in to Quanah like Hrothgar? Yes. For efficiency, we are moving to class of service rather than isolated dedicated resources for the "community cluster shared service Buying in gives you priority access to your portion of the Quanah cluster Unused compute cycles will be shared among other users You still have ownership of definite machines for capital expenditure and inventory purposes

Quanah, Hrothgar and Shutdown Q&A Session Can I purchase storage space? Yes, we are considering several options for storage based on relative needs for speed, reliability, and potential optional backup. Developing a Research Data Management System (RDMS) service in conjunction with the TTU Library. For more information, please contact Dr. Alan Sill (alan.sill@ttu.edu) Dr. Eric Rees (eric.rees@ttu.edu)

Quanah, Hrothgar and Shutdown Q&A Session Can I access my data during the shutdown? Yes, the Globus Connect endpoint (Terra) will remain online during the shutdown. Can I move my data before the shutdown Yes, please use Globus Connect See HPCC User Guides or visit: http://tinyurl.com/hpcc-data-transfer Any questions before we continue?

HPCC User Group Meeting User Engagement

User Engagement HPCC Seminars In the process of planning these out. Monthly or Bi-monthly during the long semesters Meetings will include Presentations from TTU researchers of research performed (in part) using HPCC resources. News regarding the HPCC Planned shutdowns Planned changes HPCC User Training Restarting the HPCC user training courses General / New User training courses will be offered during the 2 nd and 4 th week of each long semester and the 1 st and 2 nd week of each summer semester. Training courses on topics of interest or advanced topics will occur as interest or need arises. Surveys regarding user needs and current or future HPCC services.

HPCC User Training Topics Next HPCC User Training Date: June 12 th Topic: New User Training Future Training Topics General / New User Training Training on topics of interest Advanced Job Scheduling Singularity Containers (Building and Using)

HPCC User Group Meeting Getting Help

Getting Help Best ways to get help Visit our website - hpcc.ttu.edu Most user guides have been updated New user guides are being added Submit a support ticket Send an email to hpccsupport@ttu.edu

July Shutdown Shutdown Timeline July 6 July 9 July 10 July 11 5:00 PM Disable and begin draining all queues Shut down all clusters Upgrade the OS Install the new scheduler Test new scheduler and OS updates Continue testing and debugging any issues Complete testing 5:00 PM - Reopen clusters for general use What should you do to prepare? Ensure any jobs you wish to run are queued well before the shutdown Test your Hrothgar applications using the serial queue (which has been converted to CentOS7) Instructions will be available on our website soon - hpcc.ttu.edu Qlogin to the Hrothgar serial queue and ensure any required modules are still available Questions? Comments? Concerns? Email me at eric.rees@ttu.edu or send an email to hpccsupport@ttu.edu