Nathan Rutman SC09 Portland, OR. Lustre HSM

Similar documents
Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

Zhang, Hongchao

RobinHood Policy Engine

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Lustre overview and roadmap to Exascale computing

Lustre HSM at Cambridge. Early user experience using Intel Lemur HSM agent

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

MANAGING LUSTRE & ITS CEA

RobinHood Project Status

FROM RESEARCH TO INDUSTRY. RobinHood v3. Robinhood User Group Thomas Leibovici 16 septembre 2015

ROBINHOOD POLICY ENGINE

DNE2 High Level Design

Administering Lustre 2.0 at CEA

RobinHood Project Update

Introduction to Lustre* Architecture

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

Archiving and Retrieving Data with the Lustre TSM Copytool and LTSM

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

An Exploration of New Hardware Features for Lustre. Nathan Rutman

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

High Level Architecture For UID/GID Mapping. Revision History Date Revision Author 12/18/ jjw

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

Remote Directories High Level Design

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Fan Yong; Zhang Jinghai. High Performance Data Division

Andreas Dilger High Performance Data Division RUG 2016, Paris

Idea of Metadata Writeback Cache for Lustre

CLIO. Nikita Danilov Senior Staff Engineer Lustre Group

Coordinating Parallel HSM in Object-based Cluster Filesystems

6.5 Collective Open/Close & Epoch Distribution Demonstration

Managing Lustre TM Data Striping

CMD Code Walk through Wang Di

Fast Forward I/O & Storage

IBM Spectrum Protect HSM for Windows Version Administration Guide IBM

Parallel File Systems for HPC

Imperative Recovery. Jinshan Xiong /jinsan shiung/ 2011 Whamcloud, Inc.

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

SFA12KX and Lustre Update

DAOS Lustre Restructuring and Protocol Changes Design FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

Configuring IBM Spectrum Protect for IBM Spectrum Scale Active File Management

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

Commit-On-Sharing High Level Design

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Lustre A Platform for Intelligent Scale-Out Storage

CA485 Ray Walshe Google File System

Project Quota for Lustre

IBM Tivoli Storage Manager HSM for Windows Version 7.1. Messages

RobinHood v3. Integrity check & thoughts on local VFS changelogs. Robinhood User Group Dominique Martinet

Lustre Metadata Fundamental Benchmark and Performance

Lustre * 2.12 and Beyond Andreas Dilger, Intel High Performance Data Division LUG 2018

Data Movement & Tiering with DMF 7

Enhancing Lustre Performance and Usability

File Open, Close, and Flush Performance Issues in HDF5 Scot Breitenfeld John Mainzer Richard Warren 02/19/18

API and Usage of libhio on XC-40 Systems

Lustre TM. Scalability

NFS-Ganesha w/gpfs FSAL And Other Use Cases

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into

ECE 598 Advanced Operating Systems Lecture 19

LCOC Lustre Cache on Client

Filesystems on SSCK's HP XC6000

Lustre usages and experiences

PROJECT 6: PINTOS FILE SYSTEM. CS124 Operating Systems Winter , Lecture 25

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Improving overall Robinhood performance for use on large-scale deployments Colin Faber

Data Movement & Storage Using the Data Capacitor Filesystem

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Robin Hood 2.5 on Lustre 2.5 with DNE

LFSCK 1.5 High Level Design

Lustre & SELinux: in theory and in practice

Andreas Dilger, Intel High Performance Data Division LAD 2017

Long-term Information Storage Must store large amounts of data Information stored must survive the termination of the process using it Multiple proces

Andreas Dilger, Intel High Performance Data Division SC 2017

WITH THE LUSTRE FILE SYSTEM PETA-SCALE I/O. An Oak Ridge National Laboratory/ Lustre Center of Excellence Paper February 2008.

W4118 Operating Systems. Instructor: Junfeng Yang

CSC Operating Systems Spring Lecture - XIX Storage and I/O - II. Tevfik Koşar. Louisiana State University.

RAID Structure. RAID Levels. RAID (cont) RAID (0 + 1) and (1 + 0) Tevfik Koşar. Hierarchical Storage Management (HSM)

Oracle Database 10g: New Features for Administrators Release 2

Laid Locking HLD. Qian YingJin 06/01/24

Johann Lombardi High Performance Data Division

SAP R/3 Archiving Solution. FOROSAP.COM Manuales

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

Tutorial: Lustre 2.x Architecture

XtreemStore A SCALABLE STORAGE MANAGEMENT SOFTWARE WITHOUT LIMITS YOUR DATA. YOUR CONTROL

The UNIX Time- Sharing System

Enterprise Volume Management System Project. April 2002

Lustre Parallel Filesystem Best Practices

IBM Tivoli Storage Manager HSM for Windows Version 7.1. Administration Guide

Lustre Capability DLD

NPTEL Course Jan K. Gopinath Indian Institute of Science

Welcome! Virtual tutorial starts at 15:00 BST

Parallel File Systems. John White Lawrence Berkeley National Lab

LIME: A Framework for Lustre Global QoS Management. Li Xi DDN/Whamcloud Zeng Lingfang - JGU

IBM Spectrum Scale Archiving Policies

DL-SNAP and Fujitsu's Lustre Contributions

Transcription:

Nathan Rutman SC09 Portland, OR Lustre HSM

Goals Scalable HSM system > No scanning > No duplication of event data > Parallel data transfer Interact easily with many HSMs Focus: > Version 1 primary goal deliver ASAP > Version 2 primary goal best feature set around bz15599

Features V1 Main Features: > Whole file migration > Copy data back when layout is requested > External PolicyEngine decides what to archive, what to release from Lustre, and when > File aggregation possible if PolicyEngine and Copytool support it. V2 Plans: > Partial data released. Keep arbitrary extents on OSTs. > Restore data only when needed (read, sub-stripe write) 3

Components Policy Engine bz20693 Coordinator bz20234 MDT (LMA bz19669, layout lock bz13183) Copytool bz20334 HSM backend 4

Archive modified files Coordinator archive request Tape Attached Node store data Policy Engine consume entry Copytool copy file to HSM Client close modified file MDS write ChangeLog entry Coordinator set archived bit

Archive Modified Files 6

PolicyEngine: Robinhood (bz20693) Robinhood is a filesystem monitoring, archive, and release tool. Project > GPL-compatible, CEA development. > Website: http://robinhood.sf.net/ Main features > Gather filesystem information from scans (POSIX scan) or event-based (Lustre 2.0). > Trigger file releases depending on policy rules. > Trigger file archiving depending on policy rules (for Lustre HSM). > Policies based on file attributes and thresholds 7

Robinhood: Policy examples Based on file attributes like: path, name, time, size,... Rules can be combined with boolean operators Files can be white-listed Simple example: Purge_policy { Trigger { trigger_on = OST_usage high_watermark_pct = 85% low_watermark_pct = 80% } } " Whitelist { last_access < 1h or owner = root } 8

Copytool (bz20334) Userspace Interface Lustre and the HSM Know how to communicate with a specific HSM > Map FID to HSM identifier > Bulk data transfer llapi registration call informs coordinator 9

Architecture: Archive MDS OSS Client OSS Lustre world Copytool HSM world HSM protocols Policy engine decides what and when to archive Coordinator assigns archive actions to copytools Copytools report progress Coordinator sets HSM bits in LMA 10

Archive Protocol Policy engine (or administrator) decides to copy a file to HSM, executes HSM_Archive ioctl on file Ioctl caught by MDT, which passes request to coordinator Coordinator dispatches request to copytool. Request includes file extents Normal extents read lock is taken by copytool running on client Copytool sends "archive begin" status message to coordinator via ioctl on the file Coordinator/MDT sets "hsm_exists" bit and clears "hsm_dirty" bit. "hsm_exists" bit is never cleared, and indicates a copy (maybe partial/out of date) exists in the HSM Any writes to the file cause the MDT to set the "hsm_dirty" bit (may be lazy/delayed SOM guarantees eventual correctness.) File writes need not cancel archiving, but may be desirable in some situations (V2) Copytool sends periodic status update to coordinator via ioctl calls on the file (e.g % complete) Copytool sends "archiving done" message to coordinator via ioctl Coordinator/MDT checks hsm_dirty bit. If not dirty, MDT sets "hsm_archived" bit. If dirty, coordinator dispatches another archive request; goto step 3 MDT adds HSM_archived <fid> record to changelog Policy engine notes HSM_archived record from changelog 11

Architecture: Release Client Client Client MDS OSS OSS monitor and purges Policy Engine Policy Migrations Policy Engine > Monitor filesystem disk space usage > Pre-migrate aging data > When Lustre needs more space, releases data from files already archived in the HSM MDT > Removes file layout under layout lock 12

Release Release is possible if > File is closed > File is not dirty (HSM_dirty) > File is archived in HSM (HSM_archived) Protocol > MDT takes PW layout lock > MDT takes full file extents lock, drops extents lock > MDT verifies above conditions > MDT erases layout, removes OST objects, sets hsm_released bit Metadata is retained by MDT; looks like unstriped file Size must be saved on MDT also

Layout Lock (bz13183) New MDT layout lock is created for every file Add MDS_INODELOCK_LAYOUT inodebit and IT_LAYOUT intent If MDT calls back inodelock bit, client marks layout invalid in llu_inode_info Client gets layout if needed (intent if it can, or explicitly:) > Get the layout intent lock (cached) > If the layout is invalid, get the layout Client takes ref on layout wherever layout info is used > Extents (llu_file_rwx) > Truncate (llu_setattr_raw, ll_setattr_truncate) > Ost mtime (llu_setattr_raw, ll_setattr_ost) > Mmap (ll_file_mmap)

Layout Generation (bz13183) Layout change increases layout gen Gen is passed with extent enqueue MDT takes full file extent locks on OSTs for layout change, propagating gen to OSTs OST stores greatest seen gen If OST sees request for an old layout gen (or deleted object), then the client has been partitioned from the MDT (didn't get layout update) and is denied extent locks ESTALE (not evicted from OST) > Split lov_mds_md u32 stripe_count into u16 stripe_count and u16 layout_gen > Add layout_gen to struct obdo and md_attr > Lov increases gen in qos_prep_create

Architecture: Restore Client Client MDS OSS Coordinator Client OSS Copytool Triggered internally/automatically No policy engine interaction Transparent to client External HSM Coordinator requests restore from copytool Restore locking protocol is complicated 16

Restore RPCs

Restore Locking Protocol (bz20631) Triggered by client layout intent lock request (O_NONBLOCK should not cause restore) MDT checks hsm_released bit; if released, the MDT takes PW lock on the layout MDT creates a new layout with a similar stripe pattern as the original, increasing the layout version, and allocating new objects on new OSTs with the new version. MDT (LOV) enqueues group write lock on extents 0-EOF (no timeout for group lock) MDT releases PW layout lock Client get CR layout, but must now wait for r/w file extent locks held by MDT MDT sends request to coordinator requesting restore of the file to.lustre/fid/<fidno> with group lock id and extents 0-EOF. Coordinator distributes that request to an appropriate copytool Copytool opens.lustre/fid/<fid> for write (gets current layout) Copytool uses group lock and copies data from HSM, reporting progress via ioctl When finished, copytool reports success and closes the file, releasing group extents lock MDT clears hsm_released bit MDT releases group extents lock. This sends a completion AST to the original client, who now receives his extents lock. MDT adds HSM_copyin_complete <fid> record to changelog (for policy engine)