Assistance in Lustre administration

Similar documents
Lessons learned from Lustre file system operation

Using file systems at HC3

Challenges in making Lustre systems reliable

Extraordinary HPC file system solutions at KIT

Experiences with HP SFS / Lustre in HPC Production

Filesystems on SSCK's HP XC6000

Lustre overview and roadmap to Exascale computing

Performance Monitoring in an HP SFS Environment

Hands-On Workshop bwunicluster June 29th 2015

Parallel File Systems for HPC

Parallel File Systems Compared

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Fan Yong; Zhang Jinghai. High Performance Data Division

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

IBM IBM Open Systems Storage Solutions Version 4. Download Full Version :

A Long-distance InfiniBand Interconnection between two Clusters in Production Use

Performance Analysis and Prediction for distributed homogeneous Clusters

HP Scalable File Share G3 MSA2000fc How To

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Xyratex ClusterStor6000 & OneStor

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Lustre usages and experiences

LLNL Lustre Centre of Excellence

Lustre at Scale The LLNL Way

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Lustre High availability configuration in CETA-CIEMAT

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Disruptive Storage Workshop Hands-On Lustre

Fujitsu's Lustre Contributions - Policy and Roadmap-

The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Embedded Filesystems (Direct Client Access to Vice Partitions)

Z RESEARCH, Inc. Commoditizing Supercomputing and Superstorage. Massive Distributed Storage over InfiniBand RDMA

An Overview of Fujitsu s Lustre Based File System

Operating two InfiniBand grid clusters over 28 km distance

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

Data storage services at KEK/CRC -- status and plan

HP D2D & STOREONCE OVERVIEW

HP Supporting the HP ProLiant Storage Server Product Family.

Project Quota for Lustre

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Remote Directories High Level Design

An introduction to GPFS Version 3.3

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Dell TM Terascala HPC Storage Solution

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

SANnet II Manual Firmware Upgrade and Controller Recovery Step Guide - From 3.2x to 4.1x Guide

Diagnosing Performance Issues in Cray ClusterStor Systems

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

CSCS HPC storage. Hussein N. Harake

DELL Terascala HPC Storage Solution (DT-HSS2)

IT Certification Exams Provider! Weofferfreeupdateserviceforoneyear! h ps://

Toward a Windows Native Client (WNC) Meghan McClelland LAD2013

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

HP solutions for mission critical SQL Server Data Management environments

Practice of Software Development: Dynamic scheduler for scientific simulations

The Genesis HyperMDC is a scalable metadata cluster designed for ease-of-use and quick deployment.

Scalability Test Plan on Hyperion For the Imperative Recovery Project On the ORNL Scalable File System Development Contract

CMS experience with the deployment of Lustre

Beyond Petascale. Roger Haskin Manager, Parallel File Systems IBM Almaden Research Center

HITACHI VIRTUAL STORAGE PLATFORM G SERIES FAMILY MATRIX

An Introduction to GPFS

Gatlet - a Grid Portal Framework

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

An Overview of The Global File System

Lessons Learned From A Very Unusual Year

Acceptance Small (acc-sm) Testing on Lustre

S SNIA Storage Networking Management & Administration

Product Release Notes

Data Movement & Storage Using the Data Capacitor Filesystem

File Systems for HPC Machines. Parallel I/O

Everyday* Lustre. A short survey on Lustre tools. Sven Trautmann Engineer, Lustre Giraffe Team, Sun Microsystems

The modules covered in this course are:

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Data storage on Triton: an introduction

Parallel File Systems. John White Lawrence Berkeley National Lab

Architecting a High Performance Storage System

The Optimal CPU and Interconnect for an HPC Cluster

HPE MSA 2042 Storage. Data sheet

Triton file systems - an introduction. slide 1 of 28

Rapid database cloning using SMU and ZFS Storage Appliance How Exalogic tooling can help

High-Performance Lustre with Maximum Data Assurance

A GPFS Primer October 2005

The Leading Parallel Cluster File System

STOREONCE OVERVIEW. Neil Fleming Mid-Market Storage Development Manager. Copyright 2010 Hewlett-Packard Development Company, L.P.

Data Movement and Storage. 04/07/09 1

Computing Infrastructure for Online Monitoring and Control of High-throughput DAQ Electronics

Lustre HSM at Cambridge. Early user experience using Intel Lemur HSM agent

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

QuickSpecs. What's New. Models. HP SATA Hard Drives. Overview

DELL POWERVAULT MD FAMILY MODULAR STORAGE THE DELL POWERVAULT MD STORAGE FAMILY

Tips and Tricks for Diagnosing Lustre Problems on Cray Systems. Cory Spitz and Ann Koehler Cray Inc. 5/25/2011

SFA12KX and Lustre Update

HPE Direct-Connect External SAS Storage for HPE BladeSystem Solutions Deployment Guide

Transcription:

Assistance in Lustre administration Roland Laifer STEINBUCH CENTRE FOR COMPUTING - SCC KIT University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association www.kit.edu

Basic Lustre concepts Client Client Client Directory operations, file open/close metadata & concurrency File I/O & file locking Metadata Server Recovery, file status, file creation Object Storage Server Lustre componets: Clients (C) offer standard file system API Metadata servers (MDS) hold metadata, e.g. directory data Object Storage Servers (OSS) hold file contents and store them on Object Storage Targets (OSTs) All communicate efficiently over interconnects, e.g. with RDMA 2 01.03.2011 Roland Laifer Assistance in Lustre administration

bwgrid storage system (bwfs) concept Grid middleware for user access and data exchange Grid services Backup and archive Gateway systems BelWü University 1 University 2 University 7 3 01.03.2011 Roland Laifer Assistance in Lustre administration

bwgrid storage system building blocks Klick / BelWü GW GW GW GW GW GW Gbit Ethernet 10 Gbit Ethernet 4X DDR InfiniBand 4 Gbit Fibre Channel 3 Gbit SAS MGS MDS OSS OSS Central file system backup Central file system work: Same structure with 3 OSS pairs 4x per OSS pair Local file systems: Same structure with 4x2 instead of 4x4 enclosures per OSS pair 4 01.03.2011 Roland Laifer Assistance in Lustre administration

bwgrid storage system hardware in detail File system: 128 TB central (backup) 256 TB central (work ) 32 TB local (each site) 64 TB local (each site) Location: KIT CS KIT CS Metadata Server (MDS): KIT CN, Freiburg, Mannh., Heidelb. Stuttgart, Ulm, Tübingen # of servers 2 2 2 2 # of MSA2212fc 1 1 1 1 # of MSA2000 disk encl. 1 1 1 1 # of disks 24 * 146 GB SAS 24 * 146 GB SAS 24 * 146 GB SAS 24 * 146 GB SAS Object Storage Server : # of servers 2 6 2 2 # of MSA2212fc 4 12 4 4 # of MSA2000 disk encl. 12 36 4 4 # of disks 192 * 1 TB SATA 576 * 1 TB SATA 96 * 750 GB SATA 96 * 1 TB SATA Capacity 128 TB 256 TB 32 TB 64 TB Throughput 1500 MB/s 3500 MB/s 1500 MB/s 1500 MB/s 5 01.03.2011 Roland Laifer Assistance in Lustre administration

Lustre administration challenges (1) Very complicated hardware For example see building blocks slide For each component driver, firmware or subcomponent may fail Subcomponents are cables, adapters, caches, memory, With extreme performance new hardware bugs show up Some are related to timing issues Complex Lustre software Roughly 250,000 lines of code Lustre error messages are still hard to understand Focus on very high performance e.g. requires low level Linux kernel interfaces Distributed system at large scale Not easy to find out which part is causing problems 6 01.03.2011 Roland Laifer Assistance in Lustre administration

Lustre administration challenges (2) Inefficient user applications can cause trouble e.g. reading and writing to same file area from many nodes Not easy to find out which user is causing the trouble Importance of the file system Complete clusters are not usable if the file system hangs Corrupted or deleted user data is very annoying This talk tries to help with most challenges 7 01.03.2011 Roland Laifer Assistance in Lustre administration

Best practices for MSA2000 storage systems Also see new documentation from HP HP Scalable File Share G3 MSA2000fc How To Firmware upgrades and broken controller replacement Either unmount clients and stop all servers Requires full maintenance and waiting for jobs to complete Or shutdown the affected server pairs which causes I/O to hang Risk of some application I/O errors and follow-on problems Might cause job aborts due to the batch system wall clock limit Disable automatic partner firmware upgrade i.e. upgrade each controller separately Multiple broken disks per enclosure Up to 2 disks can be exchanged at the same time Contact support with 3 or more broken disks per enclosure Enable email alerts 8 01.03.2011 Roland Laifer Assistance in Lustre administration

Check system status Lustre status Check for LustreError messages in logs of servers and clients pdsh -a grep LustreError: /var/log/messages Without LustreError messages Lustre usually works fine LustreError messages require further investigation, see next slides Check if connections on all clients show status FULL pdsh -a 'cat /proc/fs/lustre/*/*/*_server_uuid' dshbak c Overall system health Use our script z20-hpsfs-check-health Requires clean system status to create proper reference files Understanding the output needs some experience Performance checks e.g. with dd on each OST lun, see our upgrade documentation Should be done before and after each maintenance 9 01.03.2011 Roland Laifer Assistance in Lustre administration

Understanding Lustre messages (1) LBUG means Lustre bug and indicates a software bug Should be reported, could be searched on bugzilla.lustre.org String evict means aborted communication after timeout Message on server shows client IP address and timestamp Use batch system history to identify user job(s) One possible reason is hardware failure, e.g. of InfiniBand adapter Other possible reason are inefficient applications timeout due to lots of conflicting requests from many clients e.g. caused by many tasks writing to the same file area Remounting filesystem read-only could indicate fatal failure Usually due to a storage subsystem problem Also check for SCSI error messages 10 01.03.2011 Roland Laifer Assistance in Lustre administration

Understanding Lustre messages (2) Displayed error codes are standard Linux codes Show explanation of their meaning grep -hw <error number> /usr/include/*asm*/errno*.h e.g. return code -122 means Quota exceeded Identify clients by UUID shown in logs Example of such a log entry Feb 21 14:35:08 pfs1n10 LustreError: timeout on bulk PUT o3->7fc2aa1f-6d70-21c3-4df3-fee6cf6676d1 Find out client IP address on corresponding server pfs1n10# /usr/sbin/lctl get_param "*.*.exports.*.uuid" \ grep 7fc2aa1f-6d70-21c3-4df3-fee6cf6676d1 11 01.03.2011 Roland Laifer Assistance in Lustre administration

Monitoring user activity Use collectl to monitor I/O usage Installed on servers, installation on clients required http://collectl.sourceforge.net/ Show read/write rates on clients or servers pdsh -a collectl -s l -i1 -c5 Show metadata rates on client collectl -sl --lustopts M Use Lustre statistics to find clients with high I/O usage Useful script to execute on each server (Lustre >= 1.8.2 only!) https://bugzilla.lustre.org/attachment.cgi?id=29248 No easy way to check which user on client is doing I/O Sorting open files by update time might give hints ls -lrht `lsof grep /bwfs/ perl -e'while(<stdin>) \ {if ($_ =~ m.*(/bwfs/\s*) ) {if (-f $1) {$all.= " $1";}}}; print "$all\n";'` 12 01.03.2011 Roland Laifer Assistance in Lustre administration

Investigate reason for hanging file system (1) Check if client shows all services (OSTs and MDT) lfs df Check if logs report frequent problems for a service pdsh -a grep LustreError: /var/log/messages Check if servers are unhealthy or recovering and uptime pdsh -a cat /proc/fs/lustre/health_check pdsh -a 'lctl get_param "*.*.recovery_status"' grep status: pdsh -a uptime Check if all Lustre services are still mounted normally pdsh -a 'mount grep lustre wc -l' dshbak c pdsh -a grep read-only /var/log/messages 13 01.03.2011 Roland Laifer Assistance in Lustre administration

Investigate reason for hanging file system (2) Identify server and mount point for affected service pdsh -a 'lctl get_param *.*.mntdev Also compare /opt/hp/sfs/scripts/filesystem.csv Find out affected storage device VOLID=`pdsh -w server grep mpathnum /var/lib/multipath/bindings \ perl -ne 'if (/(\w{16})\s*$/) {print "$1\n"}'` forallmsas show volumes ; done grep -B1 $VOLID Show critical and warning events on affected MSA2000 msa2000cmd.pl msa show events last 30 error Check LEDs on the storage system Do not forget to contact support at an early stage Carefully plan each step to repair the problem 14 01.03.2011 Roland Laifer Assistance in Lustre administration

Further information Detailed step by step information for upgrades bwrepo.bfg.uni-freiburg.de below ~unikarlsruhe/repo/g3.2_upgrade More talks like this http://www.scc.kit.edu/produkte/lustre e.g. Using file systems at HC3 for hints to use Lustre efficiently HP SFS documentation http://www.hp.com/go/sfs-docs Lustre in general http://www.lustre.org/ Lustre future http://www.opensfs.org/ http://www.whamcloud.com/ 15 01.03.2011 Roland Laifer Assistance in Lustre administration