HPC at UZH: status and plans

Similar documents
Filesystems on SSCK's HP XC6000

Shared Object-Based Storage and the HPC Data Center

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

How to Use a Supercomputer - A Boot Camp

Guillimin HPC Users Meeting February 11, McGill University / Calcul Québec / Compute Canada Montréal, QC Canada

Brutus. Above and beyond Hreidar and Gonzales

ACCRE High Performance Compute Cluster

OBTAINING AN ACCOUNT:

HPCF Cray Phase 2. User Test period. Cristian Simarro User Support. ECMWF April 18, 2016

Genius Quick Start Guide

Updating the HPC Bill Punch, Director HPCC Nov 17, 2017

HP Storage and UMCG

Introduction to High Performance Computing and an Statistical Genetics Application on the Janus Supercomputer. Purpose

Knights Landing production environment on MARCONI

Exercise Architecture of Parallel Computer Systems

Lustre usages and experiences

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

Introduction to High-Performance Computing (HPC)

Ohio Supercomputer Center

Running Applications on The Sheffield University HPC Clusters

Introduction to PICO Parallel & Production Enviroment

Introduction to High-Performance Computing (HPC)

Linux HPC Software Stack

Our new HPC-Cluster An overview

Habanero Operating Committee. January

The Last Bottleneck: How Parallel I/O can improve application performance

AN INTRODUCTION TO CLUSTER COMPUTING

Day 9: Introduction to CHTC

UAntwerpen, 24 June 2016

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Cox Business Online Backup Administrator Guide. Version 2.0

Flux: The State of the Cluster

INTRODUCTION TO THE CLUSTER

Experiences with HP SFS / Lustre in HPC Production

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

HPC Workshop. Nov. 9, 2018 James Coyle, PhD Dir. Of High Perf. Computing

Cerebro Quick Start Guide

Getting started with the CEES Grid

Computing with the Moore Cluster

Parallel Computing at DESY Zeuthen. Introduction to Parallel Computing at DESY Zeuthen and the new cluster machines

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization

Triton file systems - an introduction. slide 1 of 28

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

NUSGRID a computational grid at NUS

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

Illinois Proposal Considerations Greg Bauer

Using the IBM Opteron 1350 at OSC. October 19-20, 2010

Overview of High Performance Input/Output on LRZ HPC systems. Christoph Biardzki Richard Patra Reinhold Bader

Introduction to the SHARCNET Environment May-25 Pre-(summer)school webinar Speaker: Alex Razoumov University of Ontario Institute of Technology

Introduction to UBELIX

My operating system is old but I don't care : I'm using NIX! B.Bzeznik BUX meeting, Vilnius 22/03/2016

XSEDE New User Tutorial

Introduction to Cheyenne. 12 January, 2017 Consulting Services Group Brian Vanderwende

Graham vs legacy systems

Experiences in Optimizing a $250K Cluster for High- Performance Computing Applications

Grid Code Planner EU Code Modifications GC0100/101/102/104

Sherlock for IBIIS. William Law Stanford Research Computing

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

Outline. March 5, 2012 CIRMMT - McGill University 2

Introduction to the Cluster

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Lustre at Scale The LLNL Way

Extraordinary HPC file system solutions at KIT

Introduction to HPC Using zcluster at GACRC

GMS/Analyzer 8.2 FAQ on Reporting Database Version 1 2:00 PM Jan 10, 2017

Deep Learning on SHARCNET:

Technology Insight Series

XSEDE New User Training. Ritu Arora November 14, 2014

Compiling applications for the Cray XC

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

NCAR Globally Accessible Data Environment (GLADE) Updated: 15 Feb 2017

The LWA1 User Computing Facility Ver. 1

Cluster Clonetroop: HowTo 2014

Comet Virtualization Code & Design Sprint

Council, 26 March Information Technology Report. Executive summary and recommendations. Introduction

IFS migrates from IBM to Cray CPU, Comms and I/O

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

Our Workshop Environment

High Performance Computing (HPC) Using zcluster at GACRC

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

The Why and How of HPC-Cloud Hybrids with OpenStack

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

The Last Bottleneck: How Parallel I/O can attenuate Amdahl's Law

TOSS - A RHEL-based Operating System for HPC Clusters

Early X1 Experiences at Boeing. Jim Glidewell Information Technology Services Boeing Shared Services Group

Guillimin HPC Users Meeting. Bart Oldeman

Introduction to HPC Using zcluster at GACRC

Choosing Resources Wisely. What is Research Computing?

Server Virtualization and Optimization at HSBC. John Gibson Chief Technical Specialist HSBC Bank plc

Slurm basics. Summer Kickstart June slide 1 of 49

An introduction to checkpointing. for scientific applications

Challenges in making Lustre systems reliable

ElastiCluster Automated provisioning of computational clusters in the cloud

PoS(EGICF12-EMITC2)004

New User Seminar: Part 2 (best practices)

Our Workshop Environment

Our Workshop Environment

X Grid Engine. Where X stands for Oracle Univa Open Son of more to come...?!?

G-WAN. Complete install process for Ubuntu (both for the 32 and the 64 OS versions).

Transcription:

HPC at UZH: status and plans Dec. 4, 2013

This presentation s purpose Meet the sysadmin team. Update on what s coming soon in Schroedinger s HW. Review old and new usage policies. Discussion (later on). HPC @ UZH University of Zurich, GC3: Grid Computing Competence Center Dec. 4, 2013

We want your feedback! How do the policies impact your usage and productivity? What would you like to see introduced, and what removed? Would like changes introduced to Schroedinger to converge to the new HPC system? HPC @ UZH University of Zurich, GC3: Grid Computing Competence Center Dec. 4, 2013

Meet the team

GC3 sysadmin team Tyanko Aleksiev Antonio Messina Riccardo Murri

Contact us hpcnadmin@id.lists.uzh.ch or http://sdesk.uzh.ch

Today s HW status

Status of current HW Schroedinger s HW has been in 24 7 use 4 years long. Pieces are now starting to fail: 39% of the total HW failures happened in 2013 every 2 weeks, 3 Lustre disks fail 1 storage blade on the Panasas system failed 16 compute nodes failures in the last two months

Status of current HW Schroedinger s HW has been in 24 7 use 4 years long. Pieces are now starting to fail: 39% of the total HW failures happened in 2013 every 2 weeks, 3 Lustre disks fail 1 storage blade on the Panasas system failed 16 compute nodes failures in the last two months Replacing storage is the most important task, as it holds data.

Review of current storage policies /home: 5GB quota, nightly back-ups on Tivoli /data: 50GB quota, no backups /lustre: no quota, no backups, originally meant as scratch space

Status of support contracts Support contract with Oracle expires mid-december. HW support on the compute nodes will not be renewed. Expect a little capacity degradation during next year (estimate around 5%)

New storage HW

New storage HW: Panasas New Panasas PAS8 already bought by Informatikdienste Arrives this week (likely) two storage shelves, total 76TB raw fully redundant configuration

New Panasas: deployment (December 2013) Try to switch to the new PAS8 during this month s scheduled maintenance window. (Dec. 18) 1. Migrate all data (estimated time: 7 10 days) The old Panasas still serves /home and /data directories for IDES We might need to throttle the migration bandwidth in order not to interfere with normal cluster operations 2. Do a final rsync during the downtime filesystem must be quiescent: cluster must be free from users and jobs might take longer than expected!

New storage HW: Lustre The Informatikdienste already bought a complete replacement for the Lustre filesystem. HW will be delivered in December, but it won t be ready for productive use until after Xmas

New Lustre: deployment (February 2014) 1. Burn-in and tune servers and new Lustre software until February 18, 2014. 2. In the meanwhile, you have to copy your important data from /lustre into /home or /data 3. Switch-over to new Lustre filesystem during the scheduled maintenance in February. No Lustre files will be preserved across the switch!

New Lustre: deployment (February 2014) 1. Burn-in and tune servers and new Lustre software until February 18, 2014. 2. In the meanwhile, you have to copy your important data from /lustre into /home or /data 3. Switch-over to new Lustre filesystem during the scheduled maintenance in February. No Lustre files will be preserved across the switch!

New Lustre: deployment plan rationale 1. Current Lustre s capacity is 230TB, over 70% full. Migrating all data will take ages! 2. Lustre was conceived as a scratch, but is in fact used as a data storage. Given the filesystem size, we cannot guarantee data safety (backups, redundancy). So we would like to enforce the policy that /lustre is for scratch (i.e., temporary) files only! 3. Lustre version bumps from 1.8.5 to 2.4: possible co-existence problems

Review of future storage policies /home: 10GB quota, nightly back-ups on Tivoli /data: 100GB quota, no backups /lustre: no quota, no backups, scratch only files older than 60 days will be automatically deleted

Review of future storage policies /home: 10GB quota, nightly back-ups on Tivoli /data: 100GB quota, no backups /lustre: no quota, no backups, scratch only files older than 60 days will be automatically deleted

SW and policy changes

SW changes: Operating System No planned changes to the operating system. But we may need to update to SLES 11 SP2 because of the new Panasas. (So maybe you ll end up recompiling your applications anyway.)

modulefiles reboot (March 2014) Remove current module files and start with a new set: C/C++ compilers (GNU + Intel) FORTRAN compilers (GNU + Intel) OpenMPI FFTW MatLab what toolboxes do you use/need? R Need more supported software? Ask!

Standardize on OpenMPI (April 2014) OpenMPI 1.6 becomes the only MPI library supported by the Schroedinger admin team. Provided via module load openmpi ParastationMPI remains available, but the support contract with Par-Tec will be rescinded. Other MPI libraries will be removed (they have not been updated nor used since quite a while).

SW changes: batch system? (May 2014) Oracle is not developing GridEngine any more, nor supporting it. Switch to SLURM? already used at CSCS and in the zbox4 big changes in the usage and command-line organize a 1-day training session on the new batch system test-drive SLURM cluster available starting March 2014 what should the user-level documentation cover? Alternative: keep GridEngine keep submitting as usual known bugs will not be fixed

Timeline 2013 December New /home and /data 2014 February New /lustre March module files reboot April standardize on OpenMPI May new batch system?

Thank you! Any questions?

Timeline 2013 December New /home and /data 2014 February New /lustre March module files reboot April standardize on OpenMPI May new batch system?

Appendix

modulefiles usage Jun. Nov. 2013 mpi/openmpi/gcc 52682 java/1.7.0 27621 intel/mkl 4556 intel/comp 4507 intel/comp/11.1.064 3838 mpi/openmpi-1.4.5/gcc-4.5.0 3086 mpi/openmpi-1.4.5/gcc-4.6.1 2409 R/2.15.1 1596 matlab/r2011a 1213 intel/comp/12.1.0 1054 intel/mkl/10.1.2.024 1001 mpi/parastationmpi/intel 889 fftw/3.2.2-double 422 binutils/2.20.1 416 mpi/openmpi-1.6.2/gcc-4.6.1 329 gcc/4.6.1 228 mpi/openmpi/intel 181 mpi/parastationmpi/gcc 105 gcc/4.5.0-system 87 matlab/r2012a 40 gcc/4.5.3 12 gcc/4.5.0 11 R/2.11 9 mpi/openmpi/gcc-4.5.0 6 mpi/openmpi-1.4.3/gcc-4.5.0 6 mpi/mvapich2 6 gcc/4.4.3 6 mpi 2 matlab 2 gcc/4.4.1 2 gcc 2 intel 1 gcc/4.3.4 1 binutils 1 Back to Modulesfiles reboot