Tips and Tricks for Diagnosing Lustre Problems on Cray Systems. Cory Spitz and Ann Koehler Cray Inc. 5/25/2011

Similar documents
Assistance in Lustre administration

LNET MULTI-RAIL RESILIENCY

Scalability Test Plan on Hyperion For the Imperative Recovery Project On the ORNL Scalable File System Development Contract

Lustre overview and roadmap to Exascale computing

Fujitsu s Contribution to the Lustre Community

Correlating Log Messages for System Diagnostics

Lustre Interface Bonding

DL-SNAP and Fujitsu's Lustre Contributions

Lustre 2.8 feature : Multiple metadata modify RPCs in parallel

Challenges in making Lustre systems reliable

Toward Understanding Life-Long Performance of a Sonexion File System

Disruptive Storage Workshop Hands-On Lustre

Imperative Recovery. Jinshan Xiong /jinsan shiung/ 2011 Whamcloud, Inc.

Upstreaming of Features from FEFS

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

LIME: A Framework for Lustre Global QoS Management. Li Xi DDN/Whamcloud Zeng Lingfang - JGU

Diagnosing Performance Issues in Cray ClusterStor Systems

Extraordinary HPC file system solutions at KIT

Parallel File Systems for HPC

WITH THE LUSTRE FILE SYSTEM PETA-SCALE I/O. An Oak Ridge National Laboratory/ Lustre Center of Excellence Paper February 2008.

Fujitsu's Lustre Contributions - Policy and Roadmap-

6.5 Collective Open/Close & Epoch Distribution Demonstration

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Administering Lustre 2.0 at CEA

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Using file systems at HC3

Synergetics-Standard-SQL Server 2012-DBA-7 day Contents

Lustre TM. Scalability

HLD For SMP node affinity

High Level Architecture For UID/GID Mapping. Revision History Date Revision Author 12/18/ jjw

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Olaf Weber Senior Software Engineer SGI Storage Software. Amir Shehata Lustre Network Engineer Intel High Performance Data Division

Remote Directories High Level Design

LLNL Lustre Centre of Excellence

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

Lustre SMP scaling. Liang Zhen

To Everyone... iii To Educators... v To Students... vi Acknowledgments... vii Final Words... ix References... x. 1 ADialogueontheBook 1

Caching and reliability

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Welcome! Virtual tutorial starts at 15:00 BST

Lustre HPCS Design Overview. Andreas Dilger Senior Staff Engineer, Lustre Group Sun Microsystems

An Exploration of New Hardware Features for Lustre. Nathan Rutman

What is a file system

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

Administrative Details. CS 140 Final Review Session. Pre-Midterm. Plan For Today. Disks + I/O. Pre-Midterm, cont.

THE PROCESS ABSTRACTION. CS124 Operating Systems Winter , Lecture 7

New Storage Architectures

Lecture 2 Distributed Filesystems

LNet Roadmap & Development. Amir Shehata Lustre * Network Engineer Intel High Performance Data Division

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

High-Performance Lustre with Maximum Data Assurance

Design and Implementation of cache protecting from power failure in Disk Array

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

CephFS: Today and. Tomorrow. Greg Farnum SC '15

Chapter 11: File-System Interface

Integration Path for Intel Omni-Path Fabric attached Intel Enterprise Edition for Lustre (IEEL) LNET

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

High Performance File Serving with SMB3 and RDMA via SMB Direct

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

High Performance, Open Source, Dell Lustre TM Storage System Dell PowerVault TM MD3200 storage platform and QDR Mellanox Infiniband

CSE 451: Operating Systems. Section 10 Project 3 wrap-up, final exam review

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

November 9 th, 2015 Prof. John Kubiatowicz

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Tracing Lustre. New approach to debugging. ORNL is managed by UT-Battelle for the US Department of Energy

Network Request Scheduler Scale Testing Results. Nikitas Angelinas

Multi-tenancy: a real-life implementation

CS 537 Fall 2017 Review Session

Cluster Shared Volume

Lustre * 2.12 and Beyond Andreas Dilger, Intel High Performance Data Division LUG 2018

Network File Systems

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

Illinois Proposal Considerations Greg Bauer

CS-736 Midterm: Beyond Compare (Spring 2008)

The Google File System

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

SFA12KX and Lustre Update

DNE2 High Level Design

Software Based Fault Injection Framework For Storage Systems Vinod Eswaraprasad Smitha Jayaram Wipro Technologies

Blue Waters I/O Performance

T10PI End-to-End Data Integrity Protection for Lustre

Arcserve Backup for Windows

Lustre Metadata Fundamental Benchmark and Performance

Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling

<Insert Picture Here> Lustre Development

Architecting a High Performance Storage System

Data life cycle monitoring using RoBinHood at scale. Gabriele Paciucci Solution Architect Bruno Faccini Senior Support Engineer September LAD

Tutorial: Lustre 2.x Architecture

Dynamic Tracing and Instrumentation

Transcription:

Tips and Tricks for Diagnosing Lustre Problems on Cray Systems Cory Spitz and Ann Koehler Cray Inc. 5/25/2011

Introduction Lustre is a critical system resources Therefore, problems need to be quickly diagnosed Administrators and operators need to collect the necessary debug data the first time a problem arises Can t interrupt production workload to investigate problems Can t depend on Cray to reproduce Performance is important too Cray systems are supercomputers after all The paper and this talk cover a broad range of topics The paper covers much more 5/25/2011 Cray Inc. Proprietary Slide 2

Agenda Working with console logs and the syslog How to read console logs What to look for Common failures Collecting additional debug data Performance 5/25/2011 Cray Inc. Proprietary Slide 3

Working with console logs and the syslog Console logs are the go-to resource for Lustre problems But, Lustre logs can be overwhelming, especially at scale Lustre is chatty ;) Sequential messages from a single node can span pages printk rate limiting can only go so far Use lustrelogs.sh to separate server console logs Writes out per-server logs with unambiguous names Even if failover occurred E.g. oss1.c1-0c0s0n3.nid00131.ost0001 Appendix A of paper Minor printk level messages go to the syslog 5/25/2011 Cray Inc. Proprietary Slide 4

How to read console logs Understand node references Lustre identifies endpoints based on their LNET names <#>@<lnet>, e.g. 100@gni identifies nid00100 Console logs are prefixed with Cray cname Cross-reference names via xtprocadmin or /etc/hosts For esfs, the IPoIB address identifies an external node But, IPoIB not used for transport Understand error codes Lustre uses standard Linux/POSIX errno values E.g. the ost_write operation failed with -30 Keep errno-base.h and errno.h from /usr/include/asm-generic handy 5/25/2011 Cray Inc. Proprietary Slide 5

What to look for in console logs Identify major (node) faults LBUG Cray enables panic_on_lbug by default ASSERT Oops Call Trace Not necessarily fatal Stack trace emitted Also look for file client specific problems evict suspect admindown Ensure sane configuration and proper startup `lustre_control.sh <fsname>.fs_defs verify_config` boot:/tmp/lustre_control.<id> 5/25/2011 Cray Inc. Proprietary Slide 6

Client eviction Eviction results in lost data but clients can stay up Can even pass NHC file system test Users may characterize this as corruption Common causes Client fails to ping server within 1.5x obd_timeout Client fails to handle blocking lock callback within ldlm_timeout Failed or flaky router or routes Although clients resend RPCs Servers do not resend lock callbacks 5/25/2011 Cray Inc. Proprietary Slide 7

Client eviction examples Client side: LustreError: 11-0: an error occurred while communicating with 135@ptl. The ldlm_enqueue operation failed with -107 LustreError: 167-0: This client was evicted by test-mdt0000; in progress operations using this service will fail. Server side: Lustre: MGS: haven't heard from client 73c68998-6ada-5df5- fa9a-9cbbe5c46866 (at 7@ptl) in 679 seconds. I think it's dead, and I am evicting it. Or: LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 603s: evicting client at 415@ptl ns: mds-test-mdt0000_uuid lock: ffff88007018b800/0x6491052209158906 lrc: 3/0,0 mode: CR/CR res: 4348859/3527105419 bits 0x3 rrc: 5 type: IBT flags: 0x4000020 remote: 0x6ca282feb4c7392 expref: 13 pid: 11168 timeout: 4296831002 5/25/2011 Cray Inc. Proprietary Slide 8

Client eviction examples Client side: LustreError: 167-0: This client was evicted by lustrefs- OST0002; in progress operations using this service will fail. Server side: LustreError: 138-a: lustrefs-ost0002: A client on nid 171@gni was evicted due to a lock blocking callback to 171@gni timed out: rc -4 And: LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 105s: evicting client at 171@gni ns: filter-lustrefs-ost0002_uuid lock: ffff8803c11a8000/0x69ba7544a5270d3d lrc: 4/0,0 mode: PR/PR res: 136687655/0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x10020 remote: 0x59d12fa603479bf2 expref: 21 pid: 8567 timeout 4299954934 5/25/2011 Cray Inc. Proprietary Slide 9

Gemini HW errors and resiliency features Stack reset upon critical HW errors Gather critical errors via `xthwerrlog c crit f <file>` NIC is reset gnilnd pauses all transfers and re-establishes connections Mechanism to ensure no lagging RDMA n_mdd_held field in /proc/kgnilnd/stats errno -131, ENOTRECOVERABLE for gnilnd, but Lustre can recover Quiesce and reroute for failed links at_min and ldlm_timeout tuned up to 70s Appendix C in paper describe gnilnd codes and meanings LNet: critical hardware error: resetting all resources (count 1) LNet:3980:0:(gnilnd.c:645:kgnilnd_complete_closed_conn()) Closed conn 0xffff880614068800->0@gni (errno -131): canceled 1 TX, 0/0 RDMA LNet: critical hardware error: All threads awake! LNet: successful reset of all hardware resources 5/25/2011 Cray Inc. Proprietary Slide 10

Collecting debug kernel traces (dk log) Lustre Operations Manual Chapter 24, Lustre debugging Turn on full debug: `lctl set_param debug=-1` Increase the size of the ring buffer: `lctl set_param debug_mb=400` Start fresh: `lctl clear` Annotate: `lctl mark <annotation>` Collect ( 1 not a typo, fast binary mode): `lctl dk <file> 1` Or, enable dump_on_timeout or dump_on_eviction Convert dk log to human readable format and time scale sort_lctl.sh and lctl_daytime.sh in Appendix B of paper 5/25/2011 Cray Inc. Proprietary Slide 11

Additional debug data There is a wealth of data in Lustre /proc interfaces Most everything is documented in the Lustre Ops Manual Watch the clients import files Shows connection status, rpc state counts, service estimates `lctl get_param *.*.import` Example on next slide Cleint and server side stats `lctl get_param *.*.stats` or `llstat` Shows counts, min and max time in μsecs, sum and sum squared Recovery status on servers `lctl get_param *.*.recovery_status` LMT or llobdstat for real-time monitoring 5/25/2011 Cray Inc. Proprietary Slide 12

Client import file import: [ ] name: lustrefs-ost0001-osc-ffff8803fd227400 target: lustrefs-ost0001_uuid state: FULL connect_flags: [write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, version_recovery] import_flags: [replayable, pingable] connection: failover_nids: [26@gni, 137@gni] current_connection: 26@gni connection_attempts: 1 generation: 1 in-progress_invalidations: 0 rpcs: inflight: 0 unregistering: 0 timeouts: 0 avg_waittime: 24121 usec service_estimates: services: 70 sec network: 70 sec transactions: last_replay: 0 peer_committed: 403726926456 last_checked: 403726926456 read_data_averages: bytes_per_rpc: 1028364 usec_per_rpc: 41661 MB_per_sec: 24.68 write_data_averages: bytes_per_rpc: 1044982 usec_per_rpc: 21721 MB_per_sec: 48.10 5/25/2011 Cray Inc. Proprietary Slide 13

Metadata performance Metadata performance is one of Lustre s biggest complaints Usually voiced as the result of interactive usage Clients are limited to a single modifying metadata operation Only way to get more ops in flight is to add more nodes max_rpcs_in_flight parameter is for non-modifying ops Tune up on interactive login nodes Users tend to make it worse `ls l` is expensive on Lustre Be careful, ls is aliased to `ls color=tty` Really, it is stat() that is expensive Use `lfs check servers` instead of `/bin/df` 5/25/2011 Cray Inc. Proprietary Slide 14

Bulk read/write performance Client side: `lctl get_param osc.*.rpc_stats` Server side: `lctl get_param obdfilter.*.brw_stats` I/O times are reported Looking for 1 MiB writes all the way through to disk Avoid read-modify-write in HW RAID controller And/or avoid cache mirroring depending on RAID type Use sd_iostats data to see the effect of fs metadata (e.g. journals) Unoptimal I/O is not an error Could be silent errors (sector remapping, etc.) Could be RAID rebuild Per client stats: `lctl get_param obdfilter.*.exports.*.brw_stats` OSS Read Cache O_DIRECT cache semantics `lctl set_param obdfilter.*.readcache_max_filesize=32m` 5/25/2011 Cray Inc. Proprietary Slide 15

LNET performance Credits are key Network Interface (NI) transmit (tx) credits Maximum number of concurrent sends for the LNET Peer tx credits Number of concurrent sends to a single peer Credits are like semaphores NI and tx credits must be acquired to send to a remote peer If a credit isn t available the send is queued Monitor credit use /proc/sys/lnet/nis /proc/sys/lnet/peers Negative numbers indicate queued sends min column shows low water mark If min is negative for normal operation, consider tuning credits 5/25/2011 Cray Inc. Proprietary Slide 16

LNET router performance Two more credits need to be acquired for router tx Router buffer credits Router buffers hold bulk data for network bridging (RDMA) Less than a page tiny_router_buffers Page sized small_router_buffers 1MiB large_router_buffers Peer router buffer credits Number of router buffers used for a single peer Defaults to LND peer tx credit count Consider tuning lnet.ko module parameter peer_buffer_credits Monitor router credits /proc/sys/lnet/buffers /proc/sys/lnet/peers Again, negative numbers indicate queued operation 5/25/2011 Cray Inc. Proprietary Slide 17

Questions? 5/25/2011 Cray Inc. Proprietary Slide 18