Roads to Zester. Ken Rawlings Shawn Slavin Tom Crowe. High Performance File Systems Indiana University

Similar documents
Lustre / ZFS at Indiana University

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

Lustre overview and roadmap to Exascale computing

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

OST data migrations using ZFS snapshot/send/receive

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Community Release Update

Fast Forward I/O & Storage

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

LIME: A Framework for Lustre Global QoS Management. Li Xi DDN/Whamcloud Zeng Lingfang - JGU

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Operational characteristics of a ZFS-backed Lustre filesystem

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Johann Lombardi High Performance Data Division

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Introduction to Lustre* Architecture

ZFS Internal Structure. Ulrich Gräf Senior SE Sun Microsystems

Remote Directories High Level Design

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

RobinHood Project Update

RobinHood Project Status

Community Release Update

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

RobinHood Policy Engine

@InfluxDB. David Norton 1 / 69

Processing PCI Track Data with CDPO. David Pany

The modules covered in this course are:

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Nathan Rutman SC09 Portland, OR. Lustre HSM

Coordinating Parallel HSM in Object-based Cluster Filesystems

EventStore A Data Management System

Introduction to Linux

Dynamic Metadata Management for Petabyte-scale File Systems

HPSS Treefrog Summary MARCH 1, 2018

W H I T E P A P E R : E R R O R! R E F E R E N C E S O U R C E N O T F O U N D. T E C H N I C A L O V E R V I E W

POSIX Extension. version 0.90

Correlating Log Messages for System Diagnostics

Data Movement & Storage Using the Data Capacitor Filesystem

5.4 - DAOS Demonstration and Benchmark Report

DNE2 High Level Design

Lustre Metadata Fundamental Benchmark and Performance

An Exploration of New Hardware Features for Lustre. Nathan Rutman

ECE 598 Advanced Operating Systems Lecture 18

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

TZWorks Timeline ActivitiesCache Parser (tac) Users Guide

High Level Architecture For UID/GID Mapping. Revision History Date Revision Author 12/18/ jjw

Binary Markup Toolkit Quick Start Guide Release v November 2016

Andreas Dilger, Intel High Performance Data Division LAD 2017

Tutorial: Lustre 2.x Architecture

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

Finding the Needle Improving Management of Large Lustre File Systems. Presented by David Dillow

<Insert Picture Here> Lustre Development

HPC Input/Output. I/O and Darshan. Cristian Simarro User Support Section

API and Usage of libhio on XC-40 Systems

Andreas Dilger, Intel High Performance Data Division SC 2017

OPENSTACK: THE OPEN CLOUD

Project Quota for Lustre

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

OpenZFS Performance Improvements

A Simple Mass Storage System for the SRB Data Grid

Welcome! Virtual tutorial starts at 15:00 BST

SolidFire and Ceph Architectural Comparison

Data Movement & Tiering with DMF 7

LLNL Lustre Centre of Excellence

Parallel computing, data and storage

Jessica Popp Director of Engineering, High Performance Data Division

Indiana University s Lustre WAN: The TeraGrid and Beyond

Release notes for version 3.7.2

ROCK INK PAPER COMPUTER

LFSCK High Performance Data Division

Practical Support Solutions for a Workflow-Oriented Cray Environment

Disruptive Storage Workshop Hands-On Lustre

CephFS A Filesystem for the Future

Lustre usages and experiences

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The ZFS File System. Please read the ZFS On-Disk Specification, available at:

Case study: ext2 FS 1

an Object-Based File System for Large-Scale Federated IT Infrastructures

SQLite vs. MongoDB for Big Data

Lustre Parallel Filesystem Best Practices

File handling is an important part of any web application. You often need to open and process a file for different tasks.

Case study: ext2 FS 1

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Filesystems on SSCK's HP XC6000

Fuse Extension. version Erick Gallesio Université de Nice - Sophia Antipolis 930 route des Colles, BP 145 F Sophia Antipolis, Cedex France

CS-580K/480K Advanced Topics in Cloud Computing. Object Storage

Alternatives to Solaris Containers and ZFS for Linux on System z

DenseFS: A Cache-Compact Filesystem

The welcome page of the installer is only used to start the real installation procedure in elevated mode:

Time Series Live 2017

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

Push-button verification of Files Systems via Crash Refinement

Enhancing Lustre Performance and Usability

Ceph Intro & Architectural Overview. Abbas Bangash Intercloud Systems

libhio: Optimizing IO on Cray XC Systems With DataWarp

File System Implementation

Transcription:

Roads to Zester Ken Rawlings Shawn Slavin Tom Crowe High Performance File Systems Indiana University

Introduction Data Capacitor II IU site-wide Lustre file system Lustre 2.1.6, ldiskfs MDT/OST, 5PB, 1.5 billion inodes Reporting needs across entire file system Current system Lester + Lustre stat() Struggles since breaking 1 billion inodes Upcoming file system at IU Lustre 2.8, ZFS MDT/OST Larger than Data Capacitor II

Lester Lester, the Lustre lister Written by David Dillow at ORNL https://github.com/ornl-techint/lester Generates Lustre file list with metadata information directly from ldiskfs Easily parseable text file path, (a,c,m)-time, mode, UID, etc. 1481481220 1481481220 1481481220 1486829 601 100666 358 4 175901125 /ROOT/projects/foo/bar.txt No equivalent for ZFS

Goals Equivalent to current solution for ZFS Without Lustre stat() if possible Focus Regular files Path, UID, GID, Mode, Timestamps, & Size Two Stages Gather Lustre metadata from underlying ZFS layer Assemble MDT/OST information and compute file sizes Remain mindful of scaling needs

Priorities Doesn t need to be perfect Understood error bounds Faster than Lustre stat() Over a billion files measured in days not weeks Not parasitic on filesystem No custom code run as root on OSS/MDS

ZFS & Lustre Lustre ZFS POSIX Layer ZFS Attribute Processor Data Management Unit

ZDB Standard ZFS utility Dumps information about ZFS pools and datasets 'zdb -dddd <dataset>' outputs dataset objects information Path, Timestamps, Size, GID, UID, mode, etc. Includes Extended Attributes MDT: trusted.lov OST: trusted.fid Somewhat challenging to parse

ZDB MDT Dataset Provides Lustre metadata information Need object information for OST lookup to compute files sizes Requires decoding of trusted.lov Object lvl iblk dblk dsize lsize %full type 199 1 16K 128K 1K 128K 0.00 ZFS plain file path /ROOT/testfile uid 121 gid 12 atime Fri Oct 9 19:56:47 2015 <...> trusted.lov = \320\013\321\013\001\000\000\...

Decoding trusted.lov Tom Crowe's trusted.lov Extended Attribute decoding script Provides (ostidx, objid) object pairs Needed EA translation from ZDB format zfsobj2fid utility Written by Christopher Morrone at LLNL trusted.fid decoding from ZDB dump Available on Lustre ZFS from 2.8 forward Includes general ZDB EA translation logic

ZDB OST Dataset Parent object FatZAP has key-value pair for objid lookup: Object lvl iblk dblk dsize lsize %full type 129 2 16K 16K 16.5K 32K 100.00 ZFS directory <...> Fat ZAP stats: 265 = 421 (type: Regular File) Target ZFS object has trusted.fid EA and size Object lvl iblk dblk dsize lsize %full type 421 3 16K 128K 269M 269M 100.00 ZFS plain file <...> size 220182 trusted.fid = \000\004\000\000\002\000\000\000

File Size Example MDT 0 zfsobj [path: /ROOT/foo.txt,..., trusted.lov: \320 ] [[ostidx:0, objid:265] [ostidx:1,objid: 640]] OST 0 OST 1 fatzap [..., from_id:265, to_id: 421] fatzap [..., from_id:640, to_id: 475] zfsobj [id: 421,..., size:220182] zfsobj [id: 475,..., size: 232621]

Implementation Named Zester as homage to Lester Written in Python 2.7 Evaluating future Python 3.x move Started with in-memory representation Worked well for experimentation and initial scaling Problematic past 1 million files Moved to SQLite as primary data representation Python DB-API 2.0 Portable SQL

Zester Overview Parse MDT & OST ZDB Generate ZDB SQL DB Compute File Sizes Generate Metadata SQL DB

ZDB Parsing Each MDT & OST ZDB dataset dump parsed into separate SQLite DB file Some parsing challenges, robust so far Generated DB schema zfsobj [id, path, uid, gid, size,..., trusted.fid, trusted.lov] fatzap [id, from_id, to_id, file_type] Output mdt_<idx>.zdb -> mdt_<idx>.db ost_<idx>.zdb -> ost_<idx>.db

Metadata DB Generation Loop over all MDT ZFS objects with trusted.lov Decode trusted.lov into set of (ostidx, objid) object pairs For each pair Translate objid to target OST ZFS object using FatZAP Query target OST ZFS object for size Sum object sizes Metadata DB Represents file from Lustre viewpoint metadata [path, size, mode, gid, atime, ctime,...] Output mdt_<idx>.db ost_<idx>.db... -> metadata.db

Testing Create test files on Lustre filesystem Various modes, sizes, stripes, path depths, etc. Lustre stat() all test files Generate canonical metadata DB to test against Compare Zester metadata DB and canonical metadata DB Currently path, mode, UID, GID, size, atime, mtime, and ctime

Current Status Work in progress Promising Results No metadata errors into millions of files (including size) Allow variance of 2 seconds on timestamps Available timestamp precision low Lustre 2.8 focus Scale-up currently limited by testing infrastructure Expect limit of testing tens of millions in reasonable time Will test & scale further against new file system once available

Scalability Mindful of billion-scale file need, no known showstoppers Currently processing thousands of objects per second Consistent with billion objects measured in days not weeks Parsing parallelizable across OSTs/MDTs Profiling ZDB parsing CPU limited by strptime() Low process memory usage Move to DB server straightforward when/if necessary Python remains promising C extensions where needed

Zester ZDB DBs Queryable DB of ZFS layer underlying MDTs & OSTs ZFS Lustre "under the floorboards" Already proven valuable Investigating alternate verification approaches Useful for more than just reporting Filesystem forensics, etc. Stored information focused on project needs Will add more moving forward

Future Work Continue scale-up and testing Test with additional Lustre versions Add other file types Formalized unit & integration testing Source code investigation ZDB Lustre ZFS OSD Adapt as Lustre changes Layout Enhancement

Thank You! Your time and attention is appreciated Feedback and suggestions: hpfs-admin@iu.edu Lustre community High Performance File Systems @ IU Tom Crowe, Chris Hanna, Nathan Heald, Nathan Lavender, Ken Rawlings, Steve Simms, Shawn Slavin Source Code https://github.com/iu-hpfs/zester GPL2 licensed, collaborators welcome! Questions?