Guillimin HPC Users Meeting. Bryan Caron

Similar documents
Guillimin HPC Users Meeting. Bart Oldeman

Guillimin HPC Users Meeting February 11, McGill University / Calcul Québec / Compute Canada Montréal, QC Canada

Guillimin HPC Users Meeting November 16, 2017

Guillimin HPC Users Meeting June 16, 2016

Guillimin HPC Users Meeting March 16, 2017

Guillimin HPC Users Meeting April 13, 2017

Guillimin HPC Users Meeting

Guillimin HPC Users Meeting March 17, 2016

Outline. March 5, 2012 CIRMMT - McGill University 2

Guillimin HPC Users Meeting October 20, 2016

Guillimin HPC Users Meeting July 14, 2016

Guillimin HPC Users Meeting December 14, 2017

Users and utilization of CERIT-SC infrastructure

Introduction to High Performance Computing (HPC) Resources at GACRC

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

How to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions

Guillimin HPC Users Meeting January 13, 2017

OBTAINING AN ACCOUNT:

Introduction to High Performance Computing (HPC) Resources at GACRC

Introduction to High Performance Computing

Using ITaP clusters for large scale statistical analysis with R. Doug Crabill Purdue University

Batch Systems. Running calculations on HPC resources

Minerva User Group 2018

UAntwerpen, 24 June 2016

ACCRE High Performance Compute Cluster

High Performance Computing Resources at MSU

MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization

Introduction to Advanced Research Computing (ARC)

Workshop Set up. Workshop website: Workshop project set up account at my.osc.edu PZS0724 Nq7sRoNrWnFuLtBm

Tools for Handling Big Data and Compu5ng Demands. Humani5es and Social Science Scholars

BRC HPC Services/Savio

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Data storage services at KEK/CRC -- status and plan

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

Service Description Managed Applications for SAP

SCITAS. Scientific IT and Application support unit at EPFL. created in February today : 5 system engineers + 6 application experts SCITAS

Cluster Network Products

Introduction to HPC Using the New Cluster at GACRC

SPARC 2 Consultations January-February 2016

Batch Systems. Running your jobs on an HPC machine

Grid Computing Competence Center Large Scale Computing Infrastructures (MINF 4526 HS2011)

Introduction to NCAR HPC. 25 May 2017 Consulting Services Group Brian Vanderwende

XSEDE New User Tutorial

CENTER FOR HIGH PERFORMANCE COMPUTING. Overview of CHPC. Martin Čuma, PhD. Center for High Performance Computing

2017 Resource Allocations Competition Results

Introduction to Cheyenne. 12 January, 2017 Consulting Services Group Brian Vanderwende

High Performance Computing (HPC) Club Training Session. Xinsheng (Shawn) Qin

Our new HPC-Cluster An overview

Service Level Agreement Research Computing Environment and Managed Server Hosting

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

INTRODUCTION TO THE CLUSTER

Genius Quick Start Guide

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Web-Hosting: Service Level Agreement

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

The Computation and Data Needs of Canadian Astronomy

Migrating from Zcluster to Sapelo

AN INTRODUCTION TO CLUSTER COMPUTING

The cluster system. Introduction 22th February Jan Saalbach Scientific Computing Group

Introduction to HPC Resources and Linux

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

WVU RESEARCH COMPUTING INTRODUCTION. Introduction to WVU s Research Computing Services

Practical Introduction to

Mainframe Storage Best Practices Utilizing Oracle s Virtual Tape Technology

An ESS implementation in a Tier 1 HPC Centre

Introduction to PICO Parallel & Production Enviroment

I/O Monitoring at JSC, SIONlib & Resiliency

How to Use a Supercomputer - A Boot Camp

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

UL HPC Monitoring in practice: why, what, how, where to look

Practical Introduction to Message-Passing Interface (MPI)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

Release Date August 31, Adeptia Inc. 443 North Clark Ave, Suite 350 Chicago, IL 60654, USA

CloudBATCH: A Batch Job Queuing System on Clouds with Hadoop and HBase. Chen Zhang Hans De Sterck University of Waterloo

Introduction to GALILEO

MERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced

Exadata Database Machine Administration Workshop

Using Cartesius and Lisa. Zheng Meyer-Zhao - Consultant Clustercomputing

X Grid Engine. Where X stands for Oracle Univa Open Son of more to come...?!?

BEAR User Forum. 24 June 2013

Data Movement & Tiering with DMF 7

Solution Pack. Managed Services Virtual Private Cloud Managed Database Service Selections and Prerequisites

The Center for High Performance Computing. Dell Breakfast Events 20 th June 2016 Happy Sithole

Cerebro Quick Start Guide

Introduction to GALILEO

XSEDE New User Tutorial

Veeam and Azure Better together. Martin Beran Senior Systems Engineer; Czechia/Slovakia/Hungary

Introduction to GALILEO

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

Comet Virtualization Code & Design Sprint

Using Quality of Service for Scheduling on Cray XT Systems

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

ArcGIS Enterprise: Advanced Topics in Administration. Thomas Edghill & Moginraj Mohandas

Introduction to High Performance Computing and an Statistical Genetics Application on the Janus Supercomputer. Purpose

TIBCO Nimbus Cloud Service. Software Release November 2016

Brigham Young University Fulton Supercomputing Lab. Ryan Cox

Crash Course in High Performance Computing

The JANUS Computing Environment

Standard Service Level Agreement (Service based SLA) for Scientific Compute Clusters

Transcription:

July 17, 2014 Bryan Caron bryan.caron@mcgill.ca McGill University / Calcul Québec / Compute Canada Montréal, QC Canada

Outline Compute Canada News Upcoming Maintenance Downtime in August Storage System News Scheduler Updates Software and User Environment Updates Training News 2

Reminder: Compute Canada SPARC (Sustainable Planning for Advanced Research Computing) consultation process with the research community to build a national plan for advanced computing, data storage and archiving requirements targeted for CFIs planned renewal of Compute Canada infrastructure as well as funding for domain specific data projects analysis of big data software with large memory footprint specialized hardware (e.g. GPU accelerators, ) analysis of sensitive private data single systems with very large numbers of cores dedicated software platforms, gateways, and cloud VMs white papers due from research communities by July 31, 2014. More info: www.computecanada.ca Compute Canada News 3

Maintenance Downtime Guillimin Maintenance Downtime: August 11-15 Maintenance outage to the data centre cooling distribution system and electrical supply additional maintenance actions will be performed: System updates (InfiniBand drivers and tunings) Scheduler updates (patch version update) GPFS Storage updates (patch versions and tunings) Network configuration updates (40 GigE link into full production) Will require stoppage of all logins, data access and batch job activities starting at 8:00 am on Monday August 11 with services restored on Friday August 15 Status of system: www.hpc.mcgill.ca and Twitter 4

Storage System News GPFS Stability Issues - Update Regular occurrence of GPFS stability on nodes due to node expels Typical impact: interruption or halt of writing from jobs Recall: Latest Actions (June 11): 2nd update to all node IB network tunings significant improvements in stability present since that time with less than ~1-2 expels every few days Additional GPFS version updates and tunings to be applied during August 11-15 downtime will provide additional performance and stability at the communication layers used by GPFS Separate Incident: IB Core Switch Module Failure (July 1-4) core switch module failure resulted in communication issues affecting GPFS stability until July 4 - replaced on July 9 with affected nodes returned to full service additional IB module to be replaced in August downtime as preventative maintenance 5

Storage System News Reminder: Upcoming Activities Online expansion of /gs to full target size (~ 2.9PB) to be performed during August 11-15 maintenance downtime prerequisite of storage system scrub completed Tape Archive (Backup) and Hierarchical Storage Management (HSM) Integration All home directories backed-up nightly via TSM - started June 26 Access to tape for targeted backups for projects (In Test) Migration of scratch policy to use HSM rules for identification and cleanup (In Progress) Analyzing characteristics of file system contents to identify suitable HSM migration policies (In Progress) 6

Scheduler Update In general improved overall stability and performance A few outstanding issues under review with Adaptive Computing Testing in development environment with update to Torque 4.2.8 in progress Full update to Torque 4.2.8 planned for August 11-15 maintenance window Recall: April 10 - qsub for job submission enabled Default PATH settings updated to include Torque commands (qsub, qstat, ) Much faster response for submissions, queries compared to Moab commands (msub, canceljob, ) qsub submission filter: qsub A <RAPid> required for proper accounting and priority assignment (will be relaxed later) 7

Scheduler Update Job submission documentation updated www.hpc.mcgill.ca Documentation Submitting Your Job With migration to CentOS 6 nodes are set to new scheduler In the default queue, the chosen node depends on the pmem (memory) PBS parameter or node feature (ie. m256g, m512g, ) Internal routing for short jobs in default queue 8

Scheduler Update Default Queue - Serial Jobs(nodes=1:ppn=n, n<12) (new:sw2) Default Queue - Parallel Jobs (new:higher walltime boundary) 9

Scheduler Update Extra large memory nodes (XLM2) Alternative to ScaleMP (offline, to be reimaged to CentOS-6) Some nodes reserved by CFI grant holders, others get 12 hours only 10

Scheduler Update Prototype new version of cluster utilization graphs: http://tinyurl.com/guillimin-graph-prototype Please direct comments and/or suggestions to: guillimin@calculquebec.ca 10

Software Update New Installations NCVIEW/2.1.2-gcc espresso/5.1 (Quantum Espresso) cmake/2.8.12.2 12

Training News See Training at www.hpc.mcgill.ca for our full calendar of training and workshops for 2014 and to register all materials from previous workshops are available online Upcoming: August 19 - Scientific Visualization Tools originally scheduled for August 17 Recently Completed: July 10 - MapReduce and Hadoop for Big Data June 5 - Advanced OpenMP May 22 - Introduction to the Xeon Phi 13

User Feedback and Discussion Questions? Comments? We value your feedback. Guillimin Operational News for Users Follow us on Twitter: http://twitter.com/mcgillhpc 14