CYFRONET SITE REPORT IMPROVING SLURM USABILITY AND MONITORING. M. Pawlik, J. Budzowski, L. Flis, P. Lasoń, M. Magryś

Similar documents
Supercomputer and grid infrastructure! in Poland!

Bazaar Site Admin Toolkit (BazaarSAT)

Monash High Performance Computing

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

LBRN - HPC systems : CCT, LSU

Genomic Data Analysis Services Available for PL-Grid Users

HPC Capabilities at Research Intensive Universities

Tier-2 structure in Poland. R. Gokieli Institute for Nuclear Studies, Warsaw M. Witek Institute of Nuclear Physics, Cracow

Comet Virtualization Code & Design Sprint

Slurm at the George Washington University Tim Wickberg - Slurm User Group Meeting 2015

University at Buffalo Center for Computational Research

AN INTRODUCTION TO CLUSTER COMPUTING

The role of PIONIER network in long term preservation services for cultural heritage institutions in Poland

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Graham vs legacy systems

Poland - e-infrastructure ecosystem and relation to EOSC

The Center for High Performance Computing. Dell Breakfast Events 20 th June 2016 Happy Sithole

The RAMDISK Storage Accelerator

Moab, TORQUE, and Gold in a Heterogeneous, Federated Computing System at the University of Michigan

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

CS500 SMARTER CLUSTER SUPERCOMPUTERS

Lustre Reference Architecture

DATARMOR: Comment s'y préparer? Tina Odaka

Slurm Roadmap. Danny Auble, Morris Jette, Tim Wickberg SchedMD. Slurm User Group Meeting Copyright 2017 SchedMD LLC

MAHA. - Supercomputing System for Bioinformatics

in Action Fujitsu High Performance Computing Ecosystem Human Centric Innovation Innovation Flexibility Simplicity

Users and utilization of CERIT-SC infrastructure

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

High Performance Computing Data Management. Philippe Trautmann BDM High Performance Computing Global Research

I/O at the Center for Information Services and High Performance Computing

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Opportunities A Realistic Study of Costs Associated

1 Bull, 2011 Bull Extreme Computing

Faster Innovation - Accelerating SIMULIA Abaqus Simulations with NVIDIA GPUs. Baskar Rajagopalan Accelerated Computing, NVIDIA

Leonhard: a new cluster for Big Data at ETH

NCAR Globally Accessible Data Environment (GLADE) Updated: 15 Feb 2017

Research e-infrastructures in Czech Republic (e-infra CZ) for scientific computations, collaborative research & research support

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

WVU RESEARCH COMPUTING INTRODUCTION. Introduction to WVU s Research Computing Services

Lustre usages and experiences

Slurm Birds of a Feather

Open Smart City Open Social Innovation Infrastructure. Cezary Mazurek, Maciej Stroinski Poznan Supercomputing and Networking Center

PlaFRIM. Technical presentation of the platform

LRZ Site Report. Athens, Juan Pancorbo Armada

Parallel Visualization At TACC. Greg Abram

The Energy Challenge in HPC

Tier 2 Computer Centres CSD3. Cambridge Service for Data Driven Discovery

Performance Boost for Seismic Processing with right IT infrastructure. Vsevolod Shabad CEO and founder +7 (985)

The Why and How of HPC-Cloud Hybrids with OpenStack

BRC HPC Services/Savio

Lustre at Scale The LLNL Way

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence

Inauguration Cartesius June 14, 2013

IBM CORAL HPC System Solution

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

CALMIP : HIGH PERFORMANCE COMPUTING

ALCF Argonne Leadership Computing Facility

CNAG Advanced User Training

Virtualization of the ATLAS Tier-2/3 environment on the HPC cluster NEMO

The Cambridge Bio-Medical-Cloud An OpenStack platform for medical analytics and biomedical research

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

WebRTC video-conferencing facilities for research, educational and art societies

Update on LRZ Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities. 2 Oct 2018 Prof. Dr. Dieter Kranzlmüller

RISE SICS North Newsletter 2017:3

HPC Architectures. Types of resource currently in use

Operating two InfiniBand grid clusters over 28 km distance

Introduction to High-Performance Computing (HPC)

Fujitsu and the HPC Pyramid

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

I/O and Scheduling aspects in DEEP-EST

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

Genius Quick Start Guide

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

South African Science Gateways

Co-designing an Energy Efficient System

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

NVDIA DGX Data Center Reference Design

Introduction to Visualization on Stampede

InSilicoLab: A framework for domain-specific science gateways

Building an Exotic HPC Ecosystem at The University of Tulsa

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

HPC Current Development in Indonesia. Dr. Bens Pardamean Bina Nusantara University Indonesia

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Status and Plans The first functioning National Grid Initiative in Europe

Large Scale Remote Interactive Visualization

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

Transient Compute ARC as Cloud Front-End

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 16 th CALL (T ier-0)

The University of Oxford campus grid, expansion and integrating new partners. Dr. David Wallom Technical Manager

SARA Overview. Walter Lioen Group Leader Supercomputing & Senior Consultant

Umeå University

Umeå University

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Advances in HP Servers with Integrated NVIDIA GPUs

Introduction to High-Performance Computing (HPC)

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Transcription:

CYFRONET SITE REPORT IMPROVING SLURM USABILITY AND MONITORING M. Pawlik, J. Budzowski, L. Flis, P. Lasoń, M. Magryś

Presentation plan 2 Cyfronet introduction System description SLURM modifications Job information scripts Monitoring

ACC Cyfronet AGH-UST 4 established in 1973 part of AGH University of Science and Technology in Krakow, Poland provides free computing resources for scientific institutions center of competence in HPC and Grid Computing member of PIONIER National Research and Education Network and operator of Krakow Metropolitan Area Network for research and education participants of large EU projects: member of international collaborations:

PL-Grid infrastructure 5 Polish national IT infrastructure supporting e-science based upon resources of most powerful academic resource centers compatible and interoperable with European Grid offering grid and cloud computing paradigms coordinated by Cyfronet Benefits for users unified infrastructure from 5 separate compute centres unified access to software, compute and storage resources non-trivial quality of service Challenges unified monitoring, computing grants, accounting, security create environment of cooperation rather than competition Federation the key to success

Zeus (older system) 6 374 TFlops Several times on top500 Repurpose: Torque/Moab -> SLURM Cloud services

Prometheus Cluster 7

Prometheus Cluster 8 Installed in Q4 2015 Centos 7 + SLURM 17.02 HP Apollo 8000 20 racks (4 CDU, 16 compute) 2232 nodes, 53568 CPU cores (Haswell), 279 TB RAM 2160 regular nodes (2 CPUs, 128 GB RAM) 72 nodes with GPGPUs (2x NVIDIA Tesla K40 XL) 4 islands 2.4 PFLOPS total performance (Rpeak) 2140 TFLOPS in CPUs 256 TFLOPS in GPUs <850 kw power (including cooling)

I/O infrastructure 9 Infiniband 4x FDR (56Gb/s) Diskless nodes Improves realiability Lustre FS as main storage: Scratch: 5 PB @ 120 GB/s Archive: 5 PB @ 60 GB/s NFS for: $HOME dirs software

Prometheus Island 10

Secondary loop 11

Application & software 12 Academic workload Lots of small/medium jobs Few big jobs 330 projects 750 users Main fields: Chemistry Biochemistry (farmaceuticals) Astrophysics

SLURM 13 Really happy with it, openness, community Power saving Full shutdown/bootup instead of suspend/resume Don t power down downed nodes Patched some race conditions in slurmctld (deadlock during config read, fix coming in 17.11) Proper handling of longer account names (>20 chars) Kmem patch cgroups accounted for kmem (task/cgroup) Integration with PL-Grid: SLAs import (sacctmgr) SLA translates to limits/fs/priority Accounting data reports (sacct)

FairShare configuration 2.0 14 Deeper FS tree No static resource allocations Account names have to be unique: Use domain like names: grid.lhc (FS:10) grid.lhc.atlas (FS:5) grid.lhc.atlas.prd grid.lhc.atlas.sgm grid.lhc.alice (FS:5) Long account names are a challenge: Display in command line tools

Custom commands 15 Squeue/sstat/scontrol/sacct (arguments) User centric scripts (wrappers), important information at a glance: Pro-jobs - display information about running jobs Pro-jobs-history - display information about past jobs Support for: Basic filtering Sorting etc

Pro-jobs 16

Pro-jobs-history 17

On-line processor WSGI Grafana Monitoring integration 18 Set of services gathers data from SLURM and feeds it to Graphite/Redis monitoring system Prometheus (SLURM) Graphite Redis

Monitoring integration and dashboards 19

Node centric dashboards 20

Node centric dashboards 21 Node: State Reason Powered down Responding

Node centric dashboards 22

Node centric dashboards 23

Job centric dashboards 24

Summary 25 All of the scripts are (going to be) open sourced Toolkit rather than a complete solution Even more openness SLURM community could benefit from sharing software/knowledge Knowledge already happening on mailing list Software not yet? Questions?