LLNL Lustre Centre of Excellence

Similar documents
Lustre at Scale The LLNL Way

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

An Exploration of New Hardware Features for Lustre. Nathan Rutman

Parallel File Systems Compared

Lustre TM. Scalability

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Experiences with HP SFS / Lustre in HPC Production

Lessons learned from Lustre file system operation

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

<Insert Picture Here> Btrfs Filesystem

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Lustre overview and roadmap to Exascale computing

Parallel File Systems for HPC

An Overview of Fujitsu s Lustre Based File System

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Coordinating Parallel HSM in Object-based Cluster Filesystems

NPTEL Course Jan K. Gopinath Indian Institute of Science

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

Lustre A Platform for Intelligent Scale-Out Storage

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

<Insert Picture Here> Lustre Development

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

NNSA Advanced Simulation and Computing An Overview of Data Management Issues

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Introduction to Lustre* Architecture

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Filesystems on SSCK's HP XC6000

Current Topics in OS Research. So, what s hot?

Johann Lombardi High Performance Data Division

Challenges in making Lustre systems reliable

6.5 Collective Open/Close & Epoch Distribution Demonstration

Open Source Storage. Ric Wheeler Architect & Senior Manager April 30, 2012

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Lustre at the OLCF: Experiences and Path Forward. Galen M. Shipman Group Leader Technology Integration

Assistance in Lustre administration

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

ZFS. Right Now! Jeff Bonwick Sun Fellow

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Efficient Object Storage Journaling in a Distributed Parallel File System

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Lustre architecture for Riccardo Veraldi for the LCLS IT Team

WITH THE LUSTRE FILE SYSTEM PETA-SCALE I/O. An Oak Ridge National Laboratory/ Lustre Center of Excellence Paper February 2008.

NPTEL Course Jan K. Gopinath Indian Institute of Science

InfiniBand Networked Flash Storage

5.4 - DAOS Demonstration and Benchmark Report

Lustre / ZFS at Indiana University

Computer Systems Laboratory Sungkyunkwan University

The Google File System

An Overview of The Global File System

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

ECE 598 Advanced Operating Systems Lecture 19

libhio: Optimizing IO on Cray XC Systems With DataWarp

Using DDN IME for Harmonie

Caching and reliability

Example Implementations of File Systems

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

TOSS - A RHEL-based Operating System for HPC Clusters

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

OpenZFS Performance Improvements

Linux File Systems: Challenges and Futures Ric Wheeler Red Hat

The Hyperion Project: Collaboration for an Advanced Technology Cluster Testbed. November 2008

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

The ClusterStor Engineered Solution for HPC. Dr. Torben Kling Petersen, PhD Principal Solutions Architect - HPC

GPFS on a Cray XT. 1 Introduction. 2 GPFS Overview

OPERATING SYSTEM. Chapter 12: File System Implementation

A GPFS Primer October 2005

BTREE FILE SYSTEM (BTRFS)

GPFS on a Cray XT. Shane Canon Data Systems Group Leader Lawrence Berkeley National Laboratory CUG 2009 Atlanta, GA May 4, 2009

Operational characteristics of a ZFS-backed Lustre filesystem

HIGH PERFORMANCE STORAGE SOLUTION PRESENTATION All rights reserved RAIDIX

IBM IBM Open Systems Storage Solutions Version 4. Download Full Version :

Andreas Dilger, Intel High Performance Data Division LAD 2017

Welcome! Virtual tutorial starts at 15:00 BST

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

File System Consistency

Datura The new HPC-Plant at Albert Einstein Institute

Chapter 11: Implementing File

Chapter 11: Implementing File Systems

Application Performance on IME

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

The Spider Center-Wide File System

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Chapter 10: Mass-Storage Systems

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

Chapter 10: File System Implementation

RAIDIX 4.5. Product Features. Document revision 1.0

The modules covered in this course are:

Transcription:

LLNL Lustre Centre of Excellence Mark Gary 4/23/07 This work was performed under the auspices of the U.S. Department of Energy by University of California, Lawrence Livermore National Laboratory under Contract W-7405-Eng-48. UCRL-PRES-230018

LLNL is home to a Lustre Centre of Excellence (LCE) We enjoy a close working partnership with CFS The Lustre Centre of Excellence (LCE) is written into our ongoing CFS support contract. I consider almost everything we do with Lustre, contractual or not, to be an LCE effort item. LCE activities at LLNL are many 2

LLNL LCE Effort Areas Selected CFS/LLNL efforts At-scale testing, bug fixing, performance issue analysis fsck: Debugging/fixing Acceleration Metadata speed up Adaptive timeouts Lustre free space management LLNL development efforts ZFS prototype Failover implementation Lustre Monitoring Tool 2 (LMT2) Tri-Lab PathForward efforts 3

At-scale testing, bug fixing and analysis We operate a very large test environment for use by ourselves and CFS. We run around-the-clock at-scale testing of all of our releases Scheduled dedicated testing by CFS benefits the entire community As in other areas, our scale regularly reveals bugs and performance issues that don t show up in small-scale tests: We are constantly working with CFS on issues revealed at-scale LLNL s top-10 bugs prioritized each week Weekly meeting with CFS to review progress and plans 4

At-scale Lustre test resource ALC-ltest Quadrics Elan3 interconnect 16 GW nodes 430 compute nodes GigE GigE (dual 2.4GHz Xeons) MDS 23 OSS nodes (dual 2.8GHz Xeons) DDN Fibre Channel.. 143TB DDN IGS (storage cluster) 5

Fsck improvements Improvements include Fixing segfault due to corrupt extent headers Fixing segfault on extended attribute corruption Improving e2fsck heuristics for detecting corrupted inodes Shared block resolution - implement alternative to cloning Coverity-detected bugs, fixes Speed-up milestone Halve the time for fscks Based on looking at only active inodes (keeping track of inode allocation high-water mark). 6

Metadata speedup Goal is to: Cut ls l time by 50% Cut rm r time by 75% Improve performance (LRU create test) by 70% Achieved by client-side read ahead for MDS (for directory contents and parallel fetching of attributes) Dynamic sizing and automatic tuning (clientbased lock timeout) of the client LRU (lock) list 7

Adaptive timeouts Static timeouts used by callers of Lustre RPCs cause difficulties in unusual-load scenarios CFS is modifying calls to RPCs and other Lustre components to dynamically respond to RPC delays Make all Lustre timeouts sensitive to recent completion times, and feedback. 8

Free space management Automate and enhance Lustre free space management: Detect full OSTs and adapt Automatic space-balancing and migration Administrator-initiated space balancing Administrator-initiated full migration of OSTs Administrator-initiated on-line defragmentation of OSTs 9

Lustre Monitoring Tools v2 LMT2 The 2nd generation of Lustre Monitoring Tools (LMT) uses a MySQL database backend for storing and retrieving Lustre information related to OSTs, the metadata servers, and the routers. As a result, LMT applications can analyze Lustre performance either in real-time or over specified historical periods. There are currently three LMT2 apps in development: lstat: simple text display that operates like Unix netstat (v1.0 complete) ltop: curses-based tool that operates like Unix top (v1.0 complete) jwatch (working title) : new GUI with extensive charting capabilities (v1.0 beta) MDS lmtd OST lmtd Mgmt Node lmtcollect Aggregator MySQL Desktop LMT utils lnet router lmtd 10

LMT2 top ltop Multiple views router, router group, filesystem, OST, OSS, MDS, Low overhead Curses-based 11

The LMT2 GUI Start with xwatch-lustre functionality, then add: New views (OSS, Filesystem, Router Group, ) Plotting capability (historical trends, heartbeat, ) Customization features Full-system health at a glance Client display New graphical chart control in development. 12

LMT2 Plans LMT 2.0 release [internal] LMT 2.0 release [external] Extend database access class Add more views to GUI and ltop Extend new GUI to support historical and trending plots. Release version LMT 2.1 Collect OSS-specific data Add views for OSS-specific data in LMT utilities Extend new xwatch-lustre to include a global health view of Lustre Release version LMT 2.2 Add support for viewing client data Release version LMT 2.3 13

Failover implementation Linux-ha based Initial implementation currently undergoing test Priority on fencing and prevention of data loss requirements Based upon Release 2 of Linux-HA software (active development, testing, fixing) 14

Failover OSS node OSS node OSS node OSS node OSS node OSS node OSS node OSS node Failover Raid Controller Raid Controller OSS node OSS node Disk Drawer Disk Drawer Raid Controller Raid Controller Disk Drawer Disk Drawer Disk Drawer Disk Drawers 15

ZFS prototype LLNL is launching a prototyping effort to investigate the viability of running OSTs atop Sun s ZFS file system. Our prototyping effort only goes as far as porting a portion of ZFS into the Linux kernel Our goal is to learn the viability of the partial port and let the results guide any future work 16

Lustre/ZFS motivation EXT3 Problems Max OST FS Size of 16-32TB Offline fsck recovery time Data corruption goes unnoticed Crashes, corruption, fsck challenges and complexity ZFS Max OST FS size unlimited by file system Consistency checking is online Every block is checksummed (metadata and data) Other ZFS benefits Copy-on-write may result in more streaming I/O More redundancy options (RAIDZ2, metadata ditto blocks, ) Administrative flexibility JBOD & other hdwr options 17

Lustre/ZFS Integration Strategy System Call Replace EXT3 on OSTs with ZFS OST <fsfilt> VOP_MUMBLE() ZFS POSIX Layer (ZPL) Port ZFS Data Management Unit (DMU) and Storage Pool Allocator (SPA) only Requires fsfilt to DMU integration <dataset, object, offset> Data Management Unit (DMU) <data virtual address> Storage Pool Allocator (SPA) <physical device, offset> Device Driver 18

Tri-Lab PathForward Efforts Tri-Lab (LANL, SNL, LLNL)/HP/CFS efforts Infiniband Compute nodes speak only IB I/O nodes translate to IP for 10GigE Lustre storage exists on 10GigE LAN Clustered MDS Security Compute Node IB Switch I/O Node IB IB 10 GigE 19

Conclusion The LLNL/CFS relationship is active and varied: At-scale testing, bug fixing, performance issues fsck improvements Metadata speed up Adaptive timeouts Lustre free space management LLNL is pursuing a number of development efforts ZFS prototype Lustre Monitoring Tool 2 (LMT2) Failover implementation Tri-Labs, HP and CFS are working other areas The LCE is working and benefiting the entire Lustre community 20