European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

Similar documents
Tutorial: Lustre 2.x Architecture

Remote Directories High Level Design

LFSCK 1.5 High Level Design

LFSCK High Performance Data Division

Lustre overview and roadmap to Exascale computing

DNE2 High Level Design

Managing Lustre TM Data Striping

Lustre 2.8 feature : Multiple metadata modify RPCs in parallel

DL-SNAP and Fujitsu's Lustre Contributions

Project Quota for Lustre

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Zhang, Hongchao

Administering Lustre 2.0 at CEA

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

An Exploration of New Hardware Features for Lustre. Nathan Rutman

Lustre & SELinux: in theory and in practice

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Johann Lombardi High Performance Data Division

CS 4284 Systems Capstone

Challenges in making Lustre systems reliable

Nathan Rutman SC09 Portland, OR. Lustre HSM

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

File System Internals. Jo, Heeseung

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

RobinHood Project Status

Case study: ext2 FS 1

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Lecture 19: File System Implementation. Mythili Vutukuru IIT Bombay

4/19/2016. The ext2 file system. Case study: ext2 FS. Recap: i-nodes. Recap: i-nodes. Inode Contents. Ext2 i-nodes

Commit-On-Sharing High Level Design

File Systems for HPC Machines. Parallel I/O

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

Case study: ext2 FS 1

Fan Yong; Zhang Jinghai. High Performance Data Division

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

File System Implementation. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

High Level Architecture For UID/GID Mapping. Revision History Date Revision Author 12/18/ jjw

COS 318: Operating Systems. Journaling, NFS and WAFL

Scalable I/O. Ed Karrels,

CMD Code Walk through Wang Di

Imperative Recovery. Jinshan Xiong /jinsan shiung/ 2011 Whamcloud, Inc.

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Linux Filesystems Ext2, Ext3. Nafisa Kazi

Lustre * 2.12 and Beyond Andreas Dilger, Intel High Performance Data Division LUG 2018

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File System Implementation

W4118 Operating Systems. Instructor: Junfeng Yang

Disruptive Storage Workshop Hands-On Lustre

Operating System Concepts Ch. 11: File System Implementation

CS 318 Principles of Operating Systems

ECE 598 Advanced Operating Systems Lecture 19

Demonstration Milestone Completion for the LFSCK 2 Subproject 3.2 on the Lustre* File System FSCK Project of the SFS-DEV-001 contract.

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Files and File Systems

Fujitsu s Contribution to the Lustre Community

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Using file systems at HC3

Computer Systems Laboratory Sungkyunkwan University

The EXT2FS Library. The EXT2FS Library Version 1.37 January by Theodore Ts o

Lustre Metadata Fundamental Benchmark and Performance

2011/11/04 Sunwook Bae

Lustre Data on MDT an early look

File System Consistency

The JANUS Computing Environment

Andreas Dilger, Intel High Performance Data Division LAD 2017

Assistance in Lustre administration

CS370 Operating Systems

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Extraordinary HPC file system solutions at KIT

Persistent Storage - Datastructures and Algorithms

File System Implementation

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

ECE 598 Advanced Operating Systems Lecture 18

CS370 Operating Systems

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

The EXT2FS Library. The EXT2FS Library Version 1.38 June by Theodore Ts o

CS 318 Principles of Operating Systems

File Systems 1. File Systems

File Systems 1. File Systems

CS3600 SYSTEMS AND NETWORKS

Toward a Windows Native Client (WNC) Meghan McClelland LAD2013

DAOS Lustre Restructuring and Protocol Changes Design FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

Upstreaming of Features from FEFS

CS370 Operating Systems

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Fujitsu's Lustre Contributions - Policy and Roadmap-

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

What is a file system

DAOS Epoch Recovery Design FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

ECE 598 Advanced Operating Systems Lecture 14

UNIX File Systems. How UNIX Organizes and Accesses Files on Disk

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

File systems CS 241. May 2, University of Illinois

Announcements. Persistence: Crash Consistency

Chapter 11: Implementing File Systems

Transcription:

European Lustre Workshop Paris, France September 2011 Hands on Lustre 2.x Johann Lombardi Principal Engineer Whamcloud, Inc.

Main Changes in Lustre 2.x MDS rewrite Client I/O rewrite New ptlrpc API called req_capsule Changelogs New File IDentifier (FID) and request format incompatibility with 1.6/1.8 protocol More to come OSD restructuring 3 Hands on Lustre 2.x - European Lustre Workshop

File Identifiers (FIDs) All network filesystems rely on a file identifier Used to uniquely identify file/object in network request NFS uses a 64-bit file handle 4 Hands on Lustre 2.x - European Lustre Workshop

FIDs in Lustre 1.8 On the MDS, files are identified by: 32-bit inode number allocated by underlying ldiskfs filesystem 32-bit generation number also maintained by ldiskfs On the OSTs, objects are identified by: 64-bit object id allocated sequentially starting from 1 32-bit index which is the OST index in the LOV [client]# lfs getstripe foo foo lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_stripe_offset: 0 obdidx objid objid group 0 3 0x3 0 1 3 0x3 0 5 Hands on Lustre 2.x - European Lustre Workshop

Replay Issue Lustre Client create reply repack MDT create file allocate ino/gen & return as FID assign transno release dlm locks other clients can now see the file resend create Transno + FID (inum/gen) Restarted MDT Recreate file with same inum Problem: inum might have been reused already (e.g. llog) 6 Hands on Lustre 2.x - European Lustre Workshop

New FID Scheme in Lustre 2.x Independent of MDS backend filesystem Simplify recovery e.g. no need to regenerate inode with specific inode number during replay Get rid of the infamous iopen patch Can be generated on the client requirement for metadata writeback cache Add support for object versioning Sequence number allocated to the client Sequence # FID # Version # 64 bits 32 bits 32 bits Object identifier unique in its sequence Object version 7 Hands on Lustre 2.x - European Lustre Workshop

FIDs in Practice [client]# touch foo [client]# lfs path2fid foo [0x200000400:0x1:0x0] Sequence Object ID Version [client]# ln foo bar [client]# lfs fid2path /mnt/lustre [0x200000400:0x1:0x0] /mnt/lustre/foo /mnt/lustre/bar 8 Hands on Lustre 2.x - European Lustre Workshop

Sequence Sequences are granted to clients by servers When a client connects, a new FID sequence is allocated upon disconnect, new sequence is allocated to client when it comes back Each sequence has a limited number of objects which may be created in it on exhaustion, a new sequence should be started Sequences are cluster-wide and prevent FID collision super meta SEQ controller sequence SEQ manager sequence Client 9 Hands on Lustre 2.x - European Lustre Workshop

Sequence in Practice [client]# lctl get_param seq.cli-srv*.* seq.cli-srv-xxxxx.fid=[0x200000400:0x1:0x0] seq.cli-srv-xxxxx.server=lustre-mdt0000_uuid seq.cli-srv-xxxxx.space=[0x200000401-0x200000401]:0:0 seq.cli-srv-xxxxxx.width=131072 [client]# touch foobar [client]# lfs getstripe -v./foobar./foobar lmm_magic: 0x0BD10BD0 lmm_seq: 0x200000400 lmm_object_id: 0x2 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_stripe_pattern: 1 lmm_stripe_offset: 1 obdidx objid objid group 1 3 0x3 0 last allocated FID unused sequence allocated to this client number of FIDs that can be generated out of a sequence 10 Hands on Lustre 2.x - European Lustre Workshop

Where are FIDs stored? (1/2) The underlying filesystem still operates on inodes An object index is stored on disk to handle FID/on-disk inode mapping For ldiskfs, the object index is an IAM lookup table (namely oi.16) debugfs: ls 2(12). 2(12).. 11(20) lost+found 12(16) CONFIGS 25001(16) OBJECTS 19(20) lov_objid 22(16) oi.16 23(12) fld 24(16) seq_srv 25(16) seq_ctl 26(20) capa_keys 25002(16) PENDING 25003(12) ROOT 27(20) last_rcvd 25004(20) REM_OBJ_DIR 31(3852)CATALOGS 11 Hands on Lustre 2.x - European Lustre Workshop

Where are FIDs stored? (2/2) The FID is also stored: in an extended attribute called LMA stands for Lustre Metadata Attributes also stores SOM/HSM states see struct lustre_mdt_attrs for the format in the directory entry, along with the filename path->fid translation does not require accessing LMA XATTR ext4 & e2fsprogs patch to support this feature 12 Hands on Lustre 2.x - European Lustre Workshop

Dump/Restore with Lustre 2.x Object Index (OI) stores FID->ino translation FID also stored in LMA XATTR Dump/Restore of the MDT requires either: restoring files with original inode number so that the OI is still valid OR only block-level copy can grant this fixing the OI with new inode numbers not possible currently will be possible in the future with OI scrubbing (already funded by OpenSFS) 13 Hands on Lustre 2.x - European Lustre Workshop

Link Extended Attribute New XATTR storing list of parent FIDs and names Useful for: verifying directory hierarchy FID to path translation lfs fid2path updating parent directory entries when migrating files POSIX lookup-by-fid path permission checks 14 Hands on Lustre 2.x - European Lustre Workshop

Compatibility Mode: IGIF Filesystems upgraded from 1.8 don t have fid stored in EA or in directory entry Name/fid mapping handled by IGIF IGIF allows to reversibly map inode/generation into FID Special sequence range reserved up to ~4B of inodes 32 bits 32 bits 32 bits 32 bits 0x0 0 inum gen 0x0 0 sequence Object ID Version 15 Hands on Lustre 2.x - European Lustre Workshop

Interoperability Gotchas Upgrade from 1.8 to 2.x is supported files created with 1.8 MDT use IGIF new files use new FID scheme The 1.8 client understands the new FID format 1.8 clients can talk to 2.x servers The 2.x client does not understand the old FID format 2.x clients cannot talk to 1.8 servers servers must be upgraded first Upgrade from 1.8 to 2.x with active clients NOT possible all 1.8 clients are evicted during the upgrade 16 Hands on Lustre 2.x - European Lustre Workshop

Configuration Parameter Glitches Procfs & binary paths of group upcall have changed from mdt.group_upcall=/path/to/l_getgroups to mdt.identity_upcall=/path/to/l_getidentity upgraded filesystems have the former in configuration log warning message printed when trying to set mdt.group_upcall on 2.x, but mount still successful upgraded filesystems use NONE by default and this can be fixed by running on the MGS: lctl conf_param $FSNAME-MDT0000.mdt.identity_upcall=/path/to/l_getidentity old param can be removed with lctl conf_param d mdt.quota_type is now mdd.quota_type patch available in LU-110 17 Hands on Lustre 2.x - European Lustre Workshop

Ext4 Changes in 2.1 Flexible block group (flex_bg) used by default co-locate group bitmaps and inode tables to provide larger contiguous free spaces avoid costly seeks for both data/metadata 128TB OST support tested & validated e2fsck fast thanks to flex_bg Full e2fsck of a 128TB OST with 32k 4GB files takes 32 mins Support objects larger than 2TB ext3 limits size of an individual object to 2TB limit also hardcoded in lustre client ext4 reports new limit (16TB) in superblock and lustre clients fetch this parameter at connect time 18 Hands on Lustre 2.x - European Lustre Workshop

Merci Johann Lombardi Principal Engineer Whamcloud, Inc. 2010 2011 Whamcloud, Inc.