Challenges in making Lustre systems reliable

Similar documents
Extraordinary HPC file system solutions at KIT

Lessons learned from Lustre file system operation

Assistance in Lustre administration

Using file systems at HC3

Parallel File Systems Compared

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Filesystems on SSCK's HP XC6000

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

HP Supporting the HP ProLiant Storage Server Product Family.

Parallel File Systems for HPC

IT Certification Exams Provider! Weofferfreeupdateserviceforoneyear! h ps://

Xyratex ClusterStor6000 & OneStor

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

An Overview of Fujitsu s Lustre Based File System

Cluster Management Workflows for OnCommand System Manager

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Exam : Title : Storage Sales V2. Version : Demo

The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

Lustre usages and experiences

RobinHood Project Status

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Chapter 10: Mass-Storage Systems

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Fujitsu's Lustre Contributions - Policy and Roadmap-

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

SFA12KX and Lustre Update

DL-SNAP and Fujitsu's Lustre Contributions

Parallel File Systems. John White Lawrence Berkeley National Lab

CMS experience with the deployment of Lustre

Lustre / ZFS at Indiana University

Current Topics in OS Research. So, what s hot?

Robin Hood 2.5 on Lustre 2.5 with DNE

LLNL Lustre Centre of Excellence

Architecting a High Performance Storage System

An ESS implementation in a Tier 1 HPC Centre

Cluster Management Workflows for OnCommand System Manager

IBM Europe Announcement ZG , dated October 23, 2007

ECE Engineering Robust Server Software. Spring 2018

3.3 Understanding Disk Fault Tolerance Windows May 15th, 2007

LFSCK High Performance Data Division

Remote Directories High Level Design

Shared Object-Based Storage and the HPC Data Center

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Lustre overview and roadmap to Exascale computing

FUJITSU Storage ETERNUS AF series and ETERNUS DX S4/S3 series Non-Stop Storage Reference Architecture Configuration Guide

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Warsaw. 11 th September 2018

Experiences with HP SFS / Lustre in HPC Production

HH0-450 Exam Questions Demo Hitachi. Exam Questions HH0-450

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

HP D2D & STOREONCE OVERVIEW

HIGH PERFORMANCE STORAGE SOLUTION PRESENTATION All rights reserved RAIDIX

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

B. Using Data Guard Physical Standby to migrate from an 11.1 database to Exadata is beneficial because it allows you to adopt HCC during migration.

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

Vendor: IBM. Exam Code: Exam Name: Storage Sales V2. Version: DEMO

Dell Fluid Data solutions. Powerful self-optimized enterprise storage. Dell Compellent Storage Center: Designed for business results

White Paper. EonStor GS Family Best Practices Guide. Version: 1.1 Updated: Apr., 2018

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

High Performance Computing. NEC LxFS Storage Appliance

BackUp Strategies. ApplePickers April 12, 2017

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations

Slurm at the George Washington University Tim Wickberg - Slurm User Group Meeting 2015

X1 StorNext SAN. Jim Glidewell Information Technology Services Boeing Shared Services Group

Using QNAP Local and Remote Snapshot To Fully Protect Your Data

SAP HANA Disaster Recovery with Asynchronous Storage Replication

Deploy. A step-by-step guide to successfully deploying your new app with the FileMaker Platform

IBM System Storage DS5020 Express

An introduction to GPFS Version 3.3

Exam Name: Midrange Storage Technical Support V2

XtreemStore A SCALABLE STORAGE MANAGEMENT SOFTWARE WITHOUT LIMITS YOUR DATA. YOUR CONTROL

XenData Product Brief: SX-550 Series Servers for LTO Archives

S SNIA Storage Networking Management & Administration

CSE380 - Operating Systems

INFINIDAT Storage Architecture. White Paper

PRESENTATION TITLE GOES HERE

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Administration and security of e-learning infrastructure

A Thorough Introduction to 64-Bit Aggregates

SPECIFICATION FOR NETWORK ATTACHED STORAGE (NAS) TO BE FILLED BY BIDDER. NAS Controller Should be rack mounted with a form factor of not more than 2U

Data Movement & Tiering with DMF 7

Storage Optimization with Oracle Database 11g

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

BUSINESS CONTINUITY: THE PROFIT SCENARIO

Triton file systems - an introduction. slide 1 of 28

Virtualizing Oracle on VMware

Andreas Dilger, Intel High Performance Data Division LAD 2017

GFS: The Google File System. Dr. Yingwu Zhu

StealthWatch System Disaster Recovery Guide Recommendations and Procedures. System version 6.7.x

The Fastest Scale-Out NAS

OS6 N2310 N4310 FWRN Build 870

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

CS3600 SYSTEMS AND NETWORKS

RAIDIX 4.5. Product Features. Document revision 1.0

Transcription:

Challenges in making Lustre systems reliable Roland Laifer STEINBUCH CENTRE FOR COMPUTING - SCC KIT University of the State Roland of Baden-Württemberg Laifer Challenges and in making Lustre systems reliable LAD 14 National Laboratory of the Helmholtz Association www.kit.edu

Background and motivation Most of our compute cluster outages caused by Lustre Probably this would be similar with other parallel file systems Even recently we had seen bad Lustre versions Most bugs related to new client OS versions or quotas Frequent discussions with users about I/O errors Often small changes allow to omit Lustre evictions We had silent data corruption caused by storage hardware A huge damage needs restore of complete file systems 2 2014-09-22 Roland Laifer Challenges in making Lustre systems reliable LAD 14 LAD 14

Overview Lustre systems at KIT Challenge #1: Find stable Lustre versions Challenge #2: Find stable storage hardware Challenge #3: Identify misbehaving applications Challenge #4: Recover from disaster 3 2014-09-22 Roland Laifer Challenges in making Lustre systems reliable LAD 14 LAD 14

Lustre systems at KIT - overview Multiple clusters and file systems connected to same InfiniBand fabric Good solution to connect Lustre to midrange HPC systems Select appropriate InfiniBand routing mechanism and cable connections Allows direct access to data of other systems without LNET routers KIT cluster Scratch for KIT cluster Departments cluster Scratch for dep. cluster BW universities cluster Scratch for univ. cluster InfiniBand BW tier 2 cluster Scratch for tier 2 cluster Project data for tier 2 cluster home1 home2 HPC storage of special department 4 2014-09-22 Roland Laifer Challenges in making Lustre systems reliable LAD 14 LAD 14

Lustre systems at KIT - details System name hc3work pfs2 pfs3 Users Lustre server version KIT, 2 clusters DDN Lustre 2.4.3 universities, 4 clusters DDN Lustre 2.4.3 universities, tier 2 cluster DDN Lustre 2.4.3 # of clients 868 1941 540 # of servers 6 21 17 # of file systems 1 4 3 # of OSTs 28 2*20, 2*40 1*20, 2*40 Capacity (TB) 203 2*427, 2*853 1*427, 2*853 Throughput (GB/s) 4.5 2*8, 2*16 1*8, 2*16 Storage hardware DDN S2A9900 DDN SFA12K DDN SFA12K # of enclosures 5 20 20 # of disks 290 1200 1000 5 2014-09-22 Roland Laifer Challenges in making Lustre systems reliable LAD 14 LAD 14

Challenge #1: Find stable Lustre versions (1) Lustre 1.8.4 / 1.8.7 was running very stable Needed to upgrade to version 2.x for SLES11 SP2 client support Lustre 2.1.[2-4] on servers was pretty stable Clients with version 2.3.0 caused trouble, needed for SLES11 SP2 SLES11 SP2 clients with version 2.4.1 were stable Needed to upgrade servers to 2.4.1 for SP3/RH6.5 client support Lustre 2.4.1 on servers caused some problems Bad failover configuration caused outages (LU-3829/4243/4722) Needed to disable ACLs after security alert (LU-4703/4704) User and group quota values were wrong (LU-4345/4504) Lustre 2.5.1 clients on RH6.5 were bad LBUGs and clients hanging in status evicted (LU-5071) 6 2014-09-22 Roland Laifer Challenges in making Lustre systems reliable LAD 14 LAD 14

Challenge #1: Find stable Lustre versions (2) Lustre 2.4.3 on servers is stable SLES11 SP3 clients with Lustre 2.4.3 cause no problems RH6.5 clients with Lustre 2.5.2 are now stable, too User and group quotas are still forged (LU-4345/4504/5188) Not yet fixed in current maintenance release(?) Objects on OSTs sometimes created with arbitrary UID/GID Recommended way to fix bad quotas is hardly usable Choose stable maintenance release Stay with old stable versions Bad dependency: OS upgrade new Lustre client version required new Lustre server version required 7 2014-09-22 Roland Laifer Challenges in making Lustre systems reliable LAD 14 LAD 14

Challenge #2: Find stable storage hardware We had silent data corruption on Infortrend and HP RAIDs More details given at my talk at ELWS 11 Infortrend systems provided different data when reading twice This happened once per year on 1 of 60 RAID systems HP MSA2000 G2 had problem with cache mirroring After PCIE link failed messages e2fsck sometimes showed corruption Some hardware features greatly improve stability Features which reduce rebuild times Declustered RAID or partial rebuilds using bitmaps We have seen triple disk failures on RAID6 arrays multiple times Features which detect silent data corruption High end storage systems are most likely more stable During procurements give stability features high weighting 8 2014-09-22 Roland Laifer Challenges in making Lustre systems reliable LAD 14 LAD 14

Challenge #3: Identify misbehaving apps Find out which file is causing an eviction On version 2.x LustreError messages like this might appear: LustreError: 45766:0:(osc_lock.c:817:osc_ldlm_completion_ast()) sub@...: [0 W(2):[0, ]@[0x200001ea7:0x18682:0x0]] Use FID and file system name to identify the badly accessed file lfs fid2path pfs3wor4 [0x200001ea7:0x18682:0x0] Unfortunately, sometimes only the root of the file system is reported Use performance monitoring to check what users are doing For good instructions see Daniel Kobras talk at LAD 12 E.g. this shows bad scratch usage of home directory file systems Frequently users do not know how their I/O ends up on Lustre Enhancing misbehaving applications and reducing load stabilizes the file system and makes it more responsive lov 9 2014-09-22 Roland Laifer Challenges in making Lustre systems reliable LAD 14 LAD 14

Challenge #4: Recover from disaster (1) A disaster can be caused by hardware failure, e.g. a triple disk failure on RAID6 silent data corruption caused by hardware, firmware or software complete infrastructure loss, e.g. caused by fire or flood Timely restore of 100s TB does not work Transfer takes too long and rates are lower than expected Bottlenecks often in network or at backup system Metadata recreation rates can be limiting factor We restored a 70 TB Lustre file system with 60 million files With old hardware and IBM TSM this took 3 weeks 10 2014-09-22 Roland Laifer Challenges in making Lustre systems reliable LAD 14 LAD 14

Challenge #4: Recover from disaster (2) Solution on other huge storage systems: Do not restore but switch on client or application to backup copy! primary data client / application backup data Note: Data created after last good incremental snapshot is lost. Backup: 1. Complete snapshot 2. Transfer snapshot to backup system 3. Incremental snapshot 4. Transfer incr. snapshot 5. Install incr. snapshot on backup system Disaster recovery: 6. Remove bad incr. snapshot if required 7. Redirect application or client to backup system 11 2014-09-22 Roland Laifer Challenges in making Lustre systems reliable LAD 14 LAD 14

Challenge #4: Recover from disaster (3) Transfer previous solution to Lustre: rsnapshot uses rsync to create copy and creates multiple copies by using hard links for unchanged data Backup: Lustre 1. Use rsnapshot (rsync) to client transfer all data to backup file system 2. Use rsnapshot (rsync + hard backup links) to transfer new data file system 3. rsnapshot possibly removes old copies primary file system Note: Data created after last good rsync is lost. Disaster recovery: 4. Use good rsnapshot copy and move directories to desired location 5. Adapt mount configuration and reboot Lustre clients 12 2014-09-22 Roland Laifer Challenges in making Lustre systems reliable LAD 14 LAD 14

Challenge #4: Recover from disaster (4) Details of our solution 3 file systems with permanent data (home, project, software) Each holds production data of some groups and backup data of others Experiences Did not yet need the disaster recovery on production systems User requested restores have been done and are just a copy Backup done twice per week on one client with 4 parallel processes For 100 mill. files and with 5 TB snapshot data this takes 26 hours In addition, we still use IBM TSM backup This allows users to restore by themselves Alternatively, rsnapshot copies could be exported read-only via NFS 13 2014-09-22 Roland Laifer Challenges in making Lustre systems reliable LAD 14 LAD 14

Challenge #4: Recover from disaster (5) Restrictions of the solution Slow silent data corruption might pollute all backup data Same problem for other backup solutions We did not yet see this case, i.e. OSS go pretty fast in status read-only Recovery does not work if both file systems have critical Lustre bug Different Lustre versions on primary and backup file system might help Using lustre_rsync instead of rsync would omit file system scans We plan to investigate if and how this would work 14 2014-09-22 Roland Laifer Challenges in making Lustre systems reliable LAD 14 LAD 14

Further information rsnapshot http://www.rsnapshot.org All my talks about Lustre http://www.scc.kit.edu/produkte/lustre.php roland.laifer@kit.edu 15 2014-09-22 Roland Laifer Challenges in making Lustre systems reliable LAD 14 LAD 14