Lessons learned from Lustre file system operation

Similar documents
Challenges in making Lustre systems reliable

Assistance in Lustre administration

Using file systems at HC3

Extraordinary HPC file system solutions at KIT

Experiences with HP SFS / Lustre in HPC Production

Parallel File Systems Compared

Filesystems on SSCK's HP XC6000

Parallel File Systems for HPC

A Long-distance InfiniBand Interconnection between two Clusters in Production Use

Xyratex ClusterStor6000 & OneStor

Rechenzentrum HIGH PERFORMANCE SCIENTIFIC COMPUTING

Operating two InfiniBand grid clusters over 28 km distance

An Overview of Fujitsu s Lustre Based File System

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Performance Analysis and Prediction for distributed homogeneous Clusters

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

IT Certification Exams Provider! Weofferfreeupdateserviceforoneyear! h ps://

Performance Monitoring in an HP SFS Environment

LLNL Lustre Centre of Excellence

Datura The new HPC-Plant at Albert Einstein Institute

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

CSCS HPC storage. Hussein N. Harake

Storage Systems. Storage Systems

New Storage Architectures

Enhancing Lustre Performance and Usability

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

HP D2D & STOREONCE OVERVIEW

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Exam : Title : Storage Sales V2. Version : Demo

Lustre overview and roadmap to Exascale computing

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

QuickSpecs. ProLiant Cluster F500 for the Enterprise SAN. Overview. Retired

Storage Systems Market Analysis Dec 04

High Performance Storage Solutions

Object Storage: Redefining Bandwidth for Linux Clusters

Vendor: IBM. Exam Code: Exam Name: Storage Sales V2. Version: DEMO

Fujitsu's Lustre Contributions - Policy and Roadmap-

Architecting a High Performance Storage System

White Paper. EonStor GS Family Best Practices Guide. Version: 1.1 Updated: Apr., 2018

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Hands-On Workshop bwunicluster June 29th 2015

STOREONCE OVERVIEW. Neil Fleming Mid-Market Storage Development Manager. Copyright 2010 Hewlett-Packard Development Company, L.P.

HP Supporting the HP ProLiant Storage Server Product Family.

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

SFA12KX and Lustre Update

IBM IBM Open Systems Storage Solutions Version 4. Download Full Version :

Lustre / ZFS at Indiana University

Embedded Filesystems (Direct Client Access to Vice Partitions)

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

HP StorageWorks Continuous Access EVA 2.1 release notes update

Gatlet - a Grid Portal Framework

Storage Optimization with Oracle Database 11g

EMC VPLEX Geo with Quantum StorNext

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

Lustre at Scale The LLNL Way

RAIDIX 4.5. Product Features. Document revision 1.0

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

Building Self-Healing Mass Storage Arrays. for Large Cluster Systems

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

HP0-S15. Planning and Designing ProLiant Solutions for the Enterprise. Download Full Version :

Current Topics in OS Research. So, what s hot?

Still All on One Server: Perforce at Scale

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS PARALLEL FILE I/O 1

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

High Performance Computing. NEC LxFS Storage Appliance

Mladen Stefanov F48235 R.A.I.D

The ClusterStor Engineered Solution for HPC. Dr. Torben Kling Petersen, PhD Principal Solutions Architect - HPC

WISE: Big Data, Little Money - Lessons Learned Tim Conrow WISE System Architect

Lustre usages and experiences

CSCI 4717 Computer Architecture

High-Performance Lustre with Maximum Data Assurance

A Generic Methodology of Analyzing Performance Bottlenecks of HPC Storage Systems. Zhiqi Tao, Sr. System Engineer Lugano, March

SONAS Best Practices and options for CIFS Scalability

Grid Computing Activities at KIT

The Optimal CPU and Interconnect for an HPC Cluster

The modules covered in this course are:

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Target Environments The Smart Array 6i Controller offers superior investment protection to the following environments: Non-RAID

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

HP Scalable File Share G3 MSA2000fc How To

CMS experience with the deployment of Lustre

OpenAFS A HPC filesystem? Rich Sudlow Center for Research Computing University of Notre Dame

Beyond Petascale. Roger Haskin Manager, Parallel File Systems IBM Almaden Research Center

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Cluster Management Workflows for OnCommand System Manager

Architecting High Performance Computing Systems for Fault Tolerance and Reliability

Welcome! Virtual tutorial starts at 15:00 BST

RobinHood Project Status

PASS4TEST. IT Certification Guaranteed, The Easy Way! We offer free update service for one year

The Genesis HyperMDC is a scalable metadata cluster designed for ease-of-use and quick deployment.

Backup and archiving need not to create headaches new pain relievers are around

A Thorough Introduction to 64-Bit Aggregates

A Thorough Introduction to 64-Bit Aggregates

Transcription:

Lessons learned from Lustre file system operation Roland Laifer STEINBUCH CENTRE FOR COMPUTING - SCC KIT University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association www.kit.edu

Overview Lustre systems at KIT KIT is the merger of the University of Karlsruhe and Research Center Karlsruhe with about 9000 employees and about 20000 students Experiences with Lustre with underlying storage Options for sharing data by coupling InfiniBand fabrics by using Grid protocols 2 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation

Lustre file systems at XC1 XC1 cluster C C C C C C... (120x) C C C C C C Quadrics QSNet II Interconnect Admin MDS OSS OSS OSS OSS OSS OSS EVA EVA EVA EVA EVA EVA EVA File system /lustre/data /lustre/work HP SFS appliance with HP EVA5000 storage Production from Jan 2005 to March 2010 3 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation

Lustre file systems at XC2 XC2 cluster C C C C C C C C C C C C... (762x) InfiniBand 4X DDR Interconnect MDS MDS OSS OSS OSS OSS OSS OSS OSS OSS File system /lustre/data /lustre/work HP SFS appliance (initially) with HP storage Production since Jan 2007 4 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation

Lustre file systems at HC3 and IC1 IC1 cluster HC3 cluster C C... (213x) C C C C C... (366x) C C C InfiniBand DDR InfiniBand QDR InfiniBand DDR MDS MDS OSS OSS 2x OSS OSS 8x MDS MDS OSS OSS 2x OST OST OST OST File system /pfs/data /pfs/work /hc3/work Production on IC1 since June 2008 and on HC3 since Feb 2010 pfs is transtec/q-leap solution with transtec provigo (Infortrend) storage hc3work is DDN (HP OEM) solution with DDN S2A9900 storage 5 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation

bwgrid storage system (bwfs) concept Lustre file systems at 7 sites in state of Baden Württemberg Grid middleware for user access and data exchange Production since Feb 2009 HP SFS G3 with MSA2000 storage Grid BelWü services Backup and archive Gateway systems University 1 University 2 University 7 6 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation

Summary of current Lustre systems System name hc3work xc2 pfs bwfs Users Lustre version # of clients # of servers # of file systems KIT DDN Lustre 1.6.7.2 366 6 1 universities, industry HP SFS G3.2-3 762 10 2 departments, multiple clusters Transtec/Q-Leap Lustre 1.6.7.2 583 22 2 universities, grid communities HP SFS G3.2-[1-3] >1400 # of OSTs 28 8 + 24 12 + 48 7*8 + 16 + 48 Capacity (TB) 203 16 + 48 76 + 301 36 9 4*32 + 3*64 + 128 + 256 Throughput (GB/s) 4.5 0.7 + 2.1 1.8 + 6.0 8*1.5 + 3.5 Storage hardware DDN S2A9900 HP transtec provigo HP MSA2000 # of enclosures 5 36 62 138 # of disks 290 432 992 1656 7 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation

General Lustre experiences (1) Using Lustre as home directories works Problems with users creating 10000s of files per small job Convinced them to use local disks (we have at least one per node) Problems with unexperienced users using home for scratch data Also puts high load on backup system Enabling quotas helps to quickly identify bad users Enforcing quotas for inodes and capacity is planned Restore of home directories would last for weeks Idea is to restore important user groups first Luckily up to now complete restore was never required 8 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation

General Lustre experiences (2) Monitoring performance is important Check performance of each OST during maintenance We use dd and parallel_dd (own perl script) Find out which users are heavily stressing the system We use collectl and and script attached to bugzilla 22469 Then discuss more efficient system usage, e.g. striping parameters Nowadays Lustre is running very stable After months MDS might stall Usually server is shot by heartbeat and failover works Most problems are related to storage subsystems 9 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation

Complexity of parallel file system solutions (1) Complexity of underlying storage Lots of hardware components Cables, adapters, memory, caches, controllers, batteries, switches, disks All can break Firmware or drivers might fail Extreme performance causes problems not seen elsewhere Disks fail frequently Timing issues cause failures C C C C C C... MGS MDS OSS OSS 4X DDR InfiniBand 4 Gbit FC 3 Gbit SAS 10 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation

Complexity of parallel file system solutions (2) Complexity of parallel file system (PFS) software Complex operating system interface Complex communication layer Distributed system: components on different systems involved Recovery after failures is complicated Not easy to find out which one is causing trouble Scalability: 1000s of clients use it concurrently Performance: low level implementation required Higher level solutions loose performance Expect bugs in any PFS software Vendor tests at scale are very important Lots of similar installations are benefical 11 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation

Experiences with storage hardware HP arrays hang after disk failure under high load Happened at different sites for years System stalls, i.e. no file system check required Data corruption at bwgrid sites with HP MSA2000 Firmware of FC switch and of MSA2000 most likely reason Largely fixed by HP action plan with firmware / software upgrades Data corruption with transtec provigo 610 RAID systems File system stress test on XFS causes RAID system to hang Problem is still under investigation by Infortrend SCSI errors and OST failures with DDN S2A9900 Caused by single disk with media errors Happened twice, new firmware provides better bad disk removal Expect severe problems with midrange storage systems 12 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation

Interesting OSS storage option OSS configuration details Linux software RAID6 over RAID systems RAID systems have hardware RAID6 over disks RAID systems have one partition for each OSS No single point of failure Survives 2 broken RAID systems Survives 8 broken disks Good solution with single RAID controllers Mirrored write cache of dual controllers often is bottleneck 13 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation

Future requirements for PFS / Lustre Need better storage subsystems Fight against silent data corruption It really happens Finding responsible component is a challenge Checksums quickly show data corruption Provide increased probability to avoid huge data corruptions Storage subsystems should also check data integrity E.g. by checking the RAID parity during read operations Support efficient backup and restore Need point in time copies of the data at different location Fast data paths for backup and restore required Checkpoints and differential backups might help 14 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation

Sharing data (1): Extended InfiniBand fabric Examples: IC1 and HC3 bwgrid clusters in Heidelberg and Mannheim (28 km distance) InfiniBand coupled with Obsidian Longbow over DWDM Requirements: Select appropriate InfiniBand routing mechanism and cabling Host based subnet managers might be required Advantages: Same file system visible and usable on multiple clusters Normal Lustre setup without LNET routers Low performance impact Disadvantages: InfiniBand possibly less stable More clients possibly cause additional problems 15 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation

Sharing data (2): Grid protocols Example: bwgrid gridftp and rsync over gsissh to copy data between clusters Requirements: Grid middleware installation Advantages: Clusters usable during external network or file system problems Metadata performance not shared between clusters User ID unification not required No full data access for remote root users Disadvantages: Users have to synchronize multiple copies of data Some users do not cope with Grid certificates 16 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation

Further information Talks about Lustre administration, performance, usage http://www.scc.kit.edu/produkte/lustre.php roland.laifer@kit.edu 17 26.09.2011 Roland Laifer Lessons learned from Lustre file system operation