LLNL Lustre Centre of Excellence Mark Gary 4/23/07 This work was performed under the auspices of the U.S. Department of Energy by University of California, Lawrence Livermore National Laboratory under Contract W-7405-Eng-48. UCRL-PRES-230018
LLNL is home to a Lustre Centre of Excellence (LCE) We enjoy a close working partnership with CFS The Lustre Centre of Excellence (LCE) is written into our ongoing CFS support contract. I consider almost everything we do with Lustre, contractual or not, to be an LCE effort item. LCE activities at LLNL are many 2
LLNL LCE Effort Areas Selected CFS/LLNL efforts At-scale testing, bug fixing, performance issue analysis fsck: Debugging/fixing Acceleration Metadata speed up Adaptive timeouts Lustre free space management LLNL development efforts ZFS prototype Failover implementation Lustre Monitoring Tool 2 (LMT2) Tri-Lab PathForward efforts 3
At-scale testing, bug fixing and analysis We operate a very large test environment for use by ourselves and CFS. We run around-the-clock at-scale testing of all of our releases Scheduled dedicated testing by CFS benefits the entire community As in other areas, our scale regularly reveals bugs and performance issues that don t show up in small-scale tests: We are constantly working with CFS on issues revealed at-scale LLNL s top-10 bugs prioritized each week Weekly meeting with CFS to review progress and plans 4
At-scale Lustre test resource ALC-ltest Quadrics Elan3 interconnect 16 GW nodes 430 compute nodes GigE GigE (dual 2.4GHz Xeons) MDS 23 OSS nodes (dual 2.8GHz Xeons) DDN Fibre Channel.. 143TB DDN IGS (storage cluster) 5
Fsck improvements Improvements include Fixing segfault due to corrupt extent headers Fixing segfault on extended attribute corruption Improving e2fsck heuristics for detecting corrupted inodes Shared block resolution - implement alternative to cloning Coverity-detected bugs, fixes Speed-up milestone Halve the time for fscks Based on looking at only active inodes (keeping track of inode allocation high-water mark). 6
Metadata speedup Goal is to: Cut ls l time by 50% Cut rm r time by 75% Improve performance (LRU create test) by 70% Achieved by client-side read ahead for MDS (for directory contents and parallel fetching of attributes) Dynamic sizing and automatic tuning (clientbased lock timeout) of the client LRU (lock) list 7
Adaptive timeouts Static timeouts used by callers of Lustre RPCs cause difficulties in unusual-load scenarios CFS is modifying calls to RPCs and other Lustre components to dynamically respond to RPC delays Make all Lustre timeouts sensitive to recent completion times, and feedback. 8
Free space management Automate and enhance Lustre free space management: Detect full OSTs and adapt Automatic space-balancing and migration Administrator-initiated space balancing Administrator-initiated full migration of OSTs Administrator-initiated on-line defragmentation of OSTs 9
Lustre Monitoring Tools v2 LMT2 The 2nd generation of Lustre Monitoring Tools (LMT) uses a MySQL database backend for storing and retrieving Lustre information related to OSTs, the metadata servers, and the routers. As a result, LMT applications can analyze Lustre performance either in real-time or over specified historical periods. There are currently three LMT2 apps in development: lstat: simple text display that operates like Unix netstat (v1.0 complete) ltop: curses-based tool that operates like Unix top (v1.0 complete) jwatch (working title) : new GUI with extensive charting capabilities (v1.0 beta) MDS lmtd OST lmtd Mgmt Node lmtcollect Aggregator MySQL Desktop LMT utils lnet router lmtd 10
LMT2 top ltop Multiple views router, router group, filesystem, OST, OSS, MDS, Low overhead Curses-based 11
The LMT2 GUI Start with xwatch-lustre functionality, then add: New views (OSS, Filesystem, Router Group, ) Plotting capability (historical trends, heartbeat, ) Customization features Full-system health at a glance Client display New graphical chart control in development. 12
LMT2 Plans LMT 2.0 release [internal] LMT 2.0 release [external] Extend database access class Add more views to GUI and ltop Extend new GUI to support historical and trending plots. Release version LMT 2.1 Collect OSS-specific data Add views for OSS-specific data in LMT utilities Extend new xwatch-lustre to include a global health view of Lustre Release version LMT 2.2 Add support for viewing client data Release version LMT 2.3 13
Failover implementation Linux-ha based Initial implementation currently undergoing test Priority on fencing and prevention of data loss requirements Based upon Release 2 of Linux-HA software (active development, testing, fixing) 14
Failover OSS node OSS node OSS node OSS node OSS node OSS node OSS node OSS node Failover Raid Controller Raid Controller OSS node OSS node Disk Drawer Disk Drawer Raid Controller Raid Controller Disk Drawer Disk Drawer Disk Drawer Disk Drawers 15
ZFS prototype LLNL is launching a prototyping effort to investigate the viability of running OSTs atop Sun s ZFS file system. Our prototyping effort only goes as far as porting a portion of ZFS into the Linux kernel Our goal is to learn the viability of the partial port and let the results guide any future work 16
Lustre/ZFS motivation EXT3 Problems Max OST FS Size of 16-32TB Offline fsck recovery time Data corruption goes unnoticed Crashes, corruption, fsck challenges and complexity ZFS Max OST FS size unlimited by file system Consistency checking is online Every block is checksummed (metadata and data) Other ZFS benefits Copy-on-write may result in more streaming I/O More redundancy options (RAIDZ2, metadata ditto blocks, ) Administrative flexibility JBOD & other hdwr options 17
Lustre/ZFS Integration Strategy System Call Replace EXT3 on OSTs with ZFS OST <fsfilt> VOP_MUMBLE() ZFS POSIX Layer (ZPL) Port ZFS Data Management Unit (DMU) and Storage Pool Allocator (SPA) only Requires fsfilt to DMU integration <dataset, object, offset> Data Management Unit (DMU) <data virtual address> Storage Pool Allocator (SPA) <physical device, offset> Device Driver 18
Tri-Lab PathForward Efforts Tri-Lab (LANL, SNL, LLNL)/HP/CFS efforts Infiniband Compute nodes speak only IB I/O nodes translate to IP for 10GigE Lustre storage exists on 10GigE LAN Clustered MDS Security Compute Node IB Switch I/O Node IB IB 10 GigE 19
Conclusion The LLNL/CFS relationship is active and varied: At-scale testing, bug fixing, performance issues fsck improvements Metadata speed up Adaptive timeouts Lustre free space management LLNL is pursuing a number of development efforts ZFS prototype Lustre Monitoring Tool 2 (LMT2) Failover implementation Tri-Labs, HP and CFS are working other areas The LCE is working and benefiting the entire Lustre community 20