Lustre usages and experiences

Similar documents
Robin Hood 2.5 on Lustre 2.5 with DNE

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

I/O at the German Climate Computing Center (DKRZ)

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Extraordinary HPC file system solutions at KIT

Filesystems on SSCK's HP XC6000

NCAR Globally Accessible Data Environment (GLADE) Updated: 15 Feb 2017

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

ROBINHOOD POLICY ENGINE

Lustre HSM at Cambridge. Early user experience using Intel Lemur HSM agent

Challenges in making Lustre systems reliable

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

An ESS implementation in a Tier 1 HPC Centre

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Shared Services Canada Environment and Climate Change Canada HPC Renewal Project

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Using file systems at HC3

Data Movement & Tiering with DMF 7

MANAGING LUSTRE & ITS CEA

An Overview of Fujitsu s Lustre Based File System

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Assistance in Lustre administration

Application Performance on IME

XtreemStore A SCALABLE STORAGE MANAGEMENT SOFTWARE WITHOUT LIMITS YOUR DATA. YOUR CONTROL

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

Administering Lustre 2.0 at CEA

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

Fujitsu's Lustre Contributions - Policy and Roadmap-

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Shared Object-Based Storage and the HPC Data Center

RobinHood Policy Engine

Parallel File Systems. John White Lawrence Berkeley National Lab

The Spider Center-Wide File System

Lessons learned from Lustre file system operation

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

RobinHood Project Status

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

HPC at UZH: status and plans

NetApp High-Performance Storage Solution for Lustre

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

The Fusion Distributed File System

Fujitsu s Contribution to the Lustre Community

Xyratex ClusterStor6000 & OneStor

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Parallel File Systems Compared

libhio: Optimizing IO on Cray XC Systems With DataWarp

ARCHER/RDF Overview. How do they fit together? Andy Turner, EPCC

Green Supercomputing

Triton file systems - an introduction. slide 1 of 28

Data Movement and Storage. 04/07/09 1

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Data storage on Triton: an introduction

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

Project Quota for Lustre

Introduction to High-Performance Computing (HPC)

SFA12KX and Lustre Update

I/O at the Center for Information Services and High Performance Computing

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 16 th CALL (T ier-0)

Zhang, Hongchao

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

An introduction to GPFS Version 3.3

Lustre overview and roadmap to Exascale computing

AFM Migration: The Road To Perdition

CYFRONET SITE REPORT IMPROVING SLURM USABILITY AND MONITORING. M. Pawlik, J. Budzowski, L. Flis, P. Lasoń, M. Magryś

HPC NETWORKING IN THE REAL WORLD

NetCDF based data archiving system proposal for the ITER Fast Plant System Control prototype

Embedded Filesystems (Direct Client Access to Vice Partitions)

Data storage services at KEK/CRC -- status and plan

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

An Introduction to GPFS

The RAMDISK Storage Accelerator

Experiences using a multi-tiered GPFS file system at Mount Sinai. Bhupender Thakur Patricia Kovatch Francesca Tartagliogne Dansha Jiang

Experiences with HP SFS / Lustre in HPC Production

Diagnosing Performance Issues in Cray ClusterStor Systems

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Coordinating Parallel HSM in Object-based Cluster Filesystems

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

PRACE Project Access Technical Guidelines - 19 th Call for Proposals

SuperMike-II Launch Workshop. System Overview and Allocations

XSEDE New User Training. Ritu Arora November 14, 2014

CAS 2K13 Sept Jean-Pierre Panziera Chief Technology Director

Introduction to Lustre* Architecture

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

Storage Supporting DOE Science

GPFS on a Cray XT. Shane Canon Data Systems Group Leader Lawrence Berkeley National Laboratory CUG 2009 Atlanta, GA May 4, 2009

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

CS500 SMARTER CLUSTER SUPERCOMPUTERS

Data Movement & Storage Using the Data Capacitor Filesystem

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations

Lustre Metadata Fundamental Benchmark and Performance

Transcription:

Lustre usages and experiences at German Climate Computing Centre in Hamburg Carsten Beyer

High Performance Computing Center Exclusively for the German Climate Research Limited Company, non-profit Staff: ~ 70 Services for Climate Research: Support for Scientific Computing and Simulation, Model Optimization, Parallelization Data Management and Archiving Data Visualization (3D Graphics and Video) About DKRZ University Research Group: HPC (Prof. Dr. Ludwig) 2 / 18

Mistral First phase 2015 (second phase 2016), total cost: 41 Mio Euro Bull Supercomputer: 26 Mio Euro Bullx B700 DLC-System ~37.000 (+ ~67.000) cores (Intel Haswell/Intel Broadwell) 1.550 nodes (2x 12 Cores) (+1750 nodes 2x 18 Cores) 1,4 (3,0) PetaFLOPS 115 TB (266 TB) main memory Infiniband FDR Parallel file system: Lustre, 21 (+33) PetaByte Throughput > 0.5 TeraByte/s 3 / 18

Lustre - ClusterStor Seagate ClusterStor Setup (Phase 1 - CS9000/ Phase 2 - L300) 62 OSS / 124 OST s / 6TB disks 5 MDT 21 PB / max. 6 Billion files Lustre 2.5.1 / IB FDR 455 million files / 90% filled 74 OSS / 148 OST s / 8TB disks 7 MDT 33 PB / max. 8 Billion files Lustre 2.5.1 / IB FDR 196 million files / 40% filled 4 / 18

Filesystem structure Lustre Phase 1 HOME directories (/mnt/lustre01/pf => MDT0000) POOL directory (/mnt/lustre01/pool => MDT0000) Software tree (/mnt/lustre01/sw => MDT0000) SCRATCH dirs (/mnt/lustre01/scratch => MDT000[1-4]) WORK-directories (/mnt/lustre01/work => MDT000[1-4]) Lustre Phase 2 WORK directories (/mnt/lustre02/work => MDT000[0-6]) extension of phase 1 Soft link /work -> /mnt/lustre01/work Soft link /mnt/lustre01/work/<prj> -> /mnt/lustre02/work/<prj> 5 / 18

Migration GPFS to Lustre Copying 4.5 PB from GPFS (AIX) to Lustre (Linux) Usage of GPFS policies to generate filelist (130 mio files) Sort and split filelist (mix of big/medium/small files) as input for rsync Using SLURM on new system to schedule ~6000 Jobs for copying via IB gateway from NSD server Challenge Not overloading NSD server on previous system (2x 10 GbE per node) Changing UID/GID for some users from id s <1000 to >20000 during transfer (newer rsync version needed than in RHEL 6) Time frame to achieve this Because Benchmarks for approval of the system still ongoing 6 / 18

Migration GPFS to Lustre Where s my Quota and why takes mv so long All copied project data belonging to MDT0000 in a separate location The new project directories were distributed to MDT000[1-4] How to copy the old data to the new directories? Could not be moved easily with mv due to DNE phase 1 Trying tools like shift for copying No quota could be used as before (user/fileset/user in fileset) HOME/SCRATCH/WORK in one filesystem now Using Robinhood for soft quota (starting 2016) User can see their quota and amount of data on a DKRZ web frontend 7 / 18

Migration GPFS to Lustre Small files HOME directories (~ 30 mio files / 6 TB ) Software tree Problem to run Backup with Calypso/Simpana Full Backup of HOME takes days Large loading times for software (e.g. Matlab, Python) Generating a 300GB image file stripped accross 16 OST s Loop mount of this image file to interactive nodes (Login/Graphics) Using caching on clients for higher performance 8 / 18

Tools used - Copytools Self-Healing Independent File Transfer (Shift) Paul Kolano (NASA) Dependent on mutil (Paul Kolano / NASA) Final rsync at the needed (Hardlinks) lfs find + rsync Generating filelist with lfs find and split in equal parts rsync with split-lists (running as SLURM jobs in parallel) Final rsync needed afterwards for hardlinks, directory ownership pftool Los Alamos National Lab Easy to use, scalable on several nodes with SLURM Needs final rsync afterwards for hardlinks, directory ownership 9 / 18

Tools used - Robinhood Report generation for User reports -> HOME/SCRATCH Project reports -> WORK Total amount of data for each project (~ 220 active projects) Per user in a project (up to 130 user per project) Overview for project admin Soft quota by mail to users / project admin Setup Robinhood version 2.5.6-1 / reading Lustre changelogs 2 Robinhood server for lustre01 (connected to 2/3 MDT) 3 Robinhood server for lustre02 (connected to 2/2/3 MDT) 10 / 18

Tools used HPSS pftp Tape Archive HPSS Projects have to request storage on a annual basis (Quota) Manual copy of data by user for their projects with pftp User can decide for single / double copy or long term archive (LTA) Douple copy is also accounted twice for project LTA needs a description of data (to be identified later, stored up to 10 years) Currently no automatic migration from Lustre to HPSS Possible tool could be Robinhood Currently not tried yet How to identify data to be migrated for projects 11 / 18

Monitoring tools / sources Sources: Starting with own python daemon on CS (replaced by Seastream) Seastream API on ClusterStor (currently switched off on lustre02) Entries from syslog (Lustre clients) Lustre lllite (from clients) Tools: Opentsdb / hbase / hadoop / elastic search Frontends: Xdmod / Grafana / Kibana Icinga on Clusterstor Thanks to: Olaf Gellert, Hendrik Bockelmann, Josef Dvoracek (ATOS), Eugen Betke (Uni HH) 12 / 18

Monitoring - Overview 3340 CN cluster 24 Login On demand: System metrics Lustre llite Mistral Permanent: System metrics Lustre llite Lustre Server FS lustre01 Seastream Job stats System stats Lustre Server FS lustre02 Nginx proxy Syslog-ng cluster SLURMctld Nightly testsuite with IOR, Climate models Opentsdb 6 instances hbase 6 instances HADOOP 4 instances Log files Nginx proxy Elastic search 4 instanses Job scripts xdmod grafana kibana 13 / 18

Monitoring - Grafana 14 / 18

Monitoring - Grafana 15 / 18

Pro s / Con s from daily business Most times fine but Degradation of performance when reaching >95% filling state and not symmetric usage of OST s (lustre01) First solution was to deactivate OST S / migrating data from OST s and shifting data to lustre02 filesystem Newer solution is setting qos_threshold_rr to 5% on all MDT That was leading to about 1.2 PB of unusable disk space With the newer setting it is below 1 PB Degradation of performance by monthy RAID-check (lustre01) Lowering the raid check priority RAID check runs now 2 ½ weeks 16 / 18

Pro s / Con s from daily business Client dis-/reconnects from OST s / MDT s Issue for SLURM jobs with IO (extended runtime) Robinhood could not read changelog anymore Workaround: Failover/reboot/failback affected MDT Firmware bug on hardware (lustre02) Rare case of watchdog issue causes OSS HA pair to power down After power up of OSS and fixing OST s reconnect from clients or umount of lustre02 not possible Reboot of all Lustre clients needed (power reset) 17 / 18

Thank you for your Attention! http://www.dkrz.de Carsten Beyer beyer@dkrz.de Questions? 18 / 18