Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

Similar documents
Lustre * 2.12 and Beyond Andreas Dilger, Intel High Performance Data Division LUG 2018

Andreas Dilger, Intel High Performance Data Division LAD 2017

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Andreas Dilger, Intel High Performance Data Division SC 2017

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Lustre overview and roadmap to Exascale computing

Enhancing Lustre Performance and Usability

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Idea of Metadata Writeback Cache for Lustre

Andreas Dilger High Performance Data Division RUG 2016, Paris

LCOC Lustre Cache on Client

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Remote Directories High Level Design

Fast Forward I/O & Storage

DNE2 High Level Design

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Introduction to Lustre* Architecture

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Lustre HPCS Design Overview. Andreas Dilger Senior Staff Engineer, Lustre Group Sun Microsystems

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Fujitsu s Contribution to the Lustre Community

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Nathan Rutman SC09 Portland, OR. Lustre HSM

An Exploration of New Hardware Features for Lustre. Nathan Rutman

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

The Google File System

IME Infinite Memory Engine Technical Overview

Community Release Update

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

Lustre HSM at Cambridge. Early user experience using Intel Lemur HSM agent

Jessica Popp Director of Engineering, High Performance Data Division

Johann Lombardi High Performance Data Division

OpenZFS Performance Improvements

The Google File System

Using DDN IME for Harmonie

Community Release Update

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Current Topics in OS Research. So, what s hot?

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

SFA12KX and Lustre Update

DL-SNAP and Fujitsu's Lustre Contributions

LIME: A Framework for Lustre Global QoS Management. Li Xi DDN/Whamcloud Zeng Lingfang - JGU

Lustre TM. Scalability

GFS: The Google File System

Distributed Filesystem

Distributed File Systems II

ROBINHOOD POLICY ENGINE

Fujitsu's Lustre Contributions - Policy and Roadmap-

Project Quota for Lustre

CLOUD-SCALE FILE SYSTEMS

GFS: The Google File System. Dr. Yingwu Zhu

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

NPTEL Course Jan K. Gopinath Indian Institute of Science

Upstreaming of Features from FEFS

Challenges in making Lustre systems reliable

ZFS: What's New Jeff Bonwick Oracle

CA485 Ray Walshe Google File System

The Google File System

Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

Tutorial: Lustre 2.x Architecture

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

RobinHood Project Update

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Multi-Rail LNet for Lustre

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

Toward a Windows Native Client (WNC) Meghan McClelland LAD2013

Lustre Metadata Fundamental Benchmark and Performance

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Lustre* - Fast Forward to Exascale High Performance Data Division. Eric Barton 18th April, 2013

vsan 6.6 Performance Improvements First Published On: Last Updated On:

CS5460: Operating Systems Lecture 20: File System Reliability

NPTEL Course Jan K. Gopinath Indian Institute of Science

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Improving overall Robinhood performance for use on large-scale deployments Colin Faber

Cloudian Sizing and Architecture Guidelines

Distributed System. Gang Wu. Spring,2018

CMD Code Walk through Wang Di

SolidFire and Ceph Architectural Comparison

Distributed File Systems

Data Movement & Tiering with DMF 7

LLNL Lustre Centre of Excellence

Olaf Weber Senior Software Engineer SGI Storage Software. Amir Shehata Lustre Network Engineer Intel High Performance Data Division

Chapter 11: File System Implementation. Objectives

RobinHood Project Status

The Leading Parallel Cluster File System

Disruptive Storage Workshop Hands-On Lustre

The Btrfs Filesystem. Chris Mason

High Level Architecture For UID/GID Mapping. Revision History Date Revision Author 12/18/ jjw

An Overview of Fujitsu s Lustre Based File System

Storage. Hwansoo Han

Transcription:

Lustre 2.12 and Beyond Andreas Dilger, Whamcloud

Upcoming Release Feature Highlights 2.12 is feature complete LNet Multi-Rail Network Health improved fault tolerance Lazy Size on MDT (LSOM) efficient MDT-only scanning File Level Redundancy (FLR) enhancements usability and robustness T10 Data Integrity Field (DIF) improved data integrity DNE directory restriping better space balancing and DNE2 adoption 2.13 development and design underway Persistent Client Cache (PCC) store data in client-local NVMe File Level Redundancy Phase 2 Erasure Coding DNE auto remote directory/striping improve load/space balance across MDTs 2.14 plans continued functional and performance improvements DNE directory auto-split improve usability and performance of DNE2 Client metadata Write Back Cache (WBC) improve interactive performance, reduce latency 2

LNet Network Health, UDSP (2.12/2.14) Builds on LNet Multi-Rail in 2.10/2.11 (LU-9120 Intel/WC, HPE/SGI) Detect network interface and router failures automatically Handle LNet fault w/o lengthy Lustre recovery, optimize resend path Handle multi-interface router failures User Defined Selection Policy (LU-9121 WC, HPE) Fine grained control of interface selection Optimize RAM/CPU/PCI data transfers Useful for large NUMA machines DONE TODO 3

Data-on-MDT Improvements (LU-10176 WC 2.12+) Complementary with DNE 2 striped directories 4 Scale small file IOPS with multiple MDTs Read-on-Open fetches data (LU-10181) Reduced RPCs for common workloads Allow cross-file readahead for small files Improved locking for DoM files (LU-10175) Drop layout lock bit without full IBITS lock cancellation Avoid cache flush and extra RPCs Convert write locks to read locks Client Migrate file/component from MDT to OST (LU-10177) Migrate file/component from OST to MDT via FLR (LU-11421) Cross-file data prefetch via statahead (LU-10280) open, read, getattr layout, lock, size, read data Small file read directly from MDS MDS DONE TODO

DNE Improvements (WC 2.12/2.13) Directory restriping from single-mdt to striped/sharded directories (LU-4684) Rebalance MDT space usage, improve large directory performance Automatically create new remote directory on "best" MDT with mkdir() Simplifies use of multiple MDTs without striping all directories, similar to OST usage In 2.11 in userspace for lfs mkdir i -1 (LU-10277) In 2.13 in kernel for mkdir() with parent stripe_idx = -1 (LU-10784, LU-11213) Automatic directory restriping to avoid explicit striping at create (LU-11025) Create one-stripe directory for low overhead, scale shards/capacity/performance with size Add extra shards when master directory grows large enough (e.g. 10k entries) Move existing direntries to new directory shards, keep existing inodes in place New direntries and inodes created in new shards DONE TODO Master +4 dirshards +12 directory shards 5

ZFS Enhancements Related to Lustre (2.12+) Lustre 2.12 osd-zfs updated to use ZFS 0.7.9 Bugs in ZFS 0.7.7/0.7.10, not used by Lustre branch Working on building against upstream ZoL RPMs Features in ZFS 0.8.x release (target 2018Q4) Depends on final ZFS release date, to avoid disk format changes Sequential scrub/resilver (Nexenta) On-disk data encryption + QAT hardware acceleration (Datto) Project quota accounting (Intel) Device removal via VDEV remapping (Delphix) Metadata Allocation Class (Intel, Delphix) Declustered Parity RAID (draid) (Intel) LANDED IN PROGRESS 6

Miscellaneous Improvements (2.12) Token Bucket Filter (NRS-TBF) UID/GID policy (LU-9658 DDN) Improved JobStats allows admin-formatted JobID (LU-10698 Intel) lctl set_param jobid_env=slurm_job_id jobid_name=cluster2.%j.%e.%p HSM infrastructure improvement & optimizations (Intel/WC, Cray) Improve Coordinator (LU-10699), POSIX Copytool (LU-11379), > 32 archives (LU-10114), Lazy Size-on-MDT for local scan (purge, HSM, policy engine) (LU-9358 DDN) LSOM is not guaranteed to be accurate, but good enough for many tools LSOM made available on client for apps aware of limitations (e.g. lfs find, statx(), ) Lustre-integrated T10-PI end-to-end data checksums (LU-10472 DDN) Pass data checksums between client and OSS, avoid overhead, integrate with hardware Dump/restore of conf_params/set_param -P parameters (LU-4939 Cray) lctl --device MGS llog_print testfs-client > testfs.cfg; lctl set_param -F testfs.cfg DONE TODO 7

Improved Client Efficiency (2.12+) Disconnect idle clients from OSS (LU-7236 Intel) Reduce memory usage on client and server for large systems Reduce network pings and recovery times Aggregate statfs() RPCs on the MDS (LU-10018 WC) Reduce OSS overhead, avoid idle OST reconnection on client df Improved client read performance (LU-8709 DDN) Improved readahead code, asynchronous IO submission Reduce wakeups and background tasks on idle clients (LU-9660 Intel) Synchronize wakeups between threads/clients (per jobid?) to minimize jitter Still need to avoid DOS of server if all clients ping/reconnect at same time DONE TODO 8

Kernel Code Cleanup (ORNL, SuSE, Intel, WC, Cray) Lustre client removed from kernel 4.17 L Work continuing on client cleanups for upstream at https://github.com/neilbrown/linux Lustre 2.12 updates for kernel 4.14/4.15 (LU-10560/ LU-10805) Improve kernel time handling (Y2038, jiffies) (LU-9019) Ongoing /proc -> /sys migration and cleanup (LU-8066) Handled transparently by lctl / llapi_* - please use them Cleanup of wait_event, cfs_hash_*, and many more internals Some builds/testing with ARM64/Power8 clients (LU-10157 little-endian!) Major ldiskfs features merged into upstream ext4/e2fsprogs Large xattr (ea_inode), directories > 10M entries (large_dir) dirdata feature not yet merged (needs test interface) Mmm, Lustre 9

Performance Improvements for Flash (WC 2.12+) Reduce server CPU overhead to improve small flash IOPS Performance is primarily CPU-limited for small read/write Any reduction in CPU usage directly translates to improved IOPS Reduce needless lu_env operations in DLM (LU-11164) Avoid page cache on flash OSS (LU-11347) Avoids CPU overhead/lock contention for page eviction Streaming flash performance is often network limited ZFS drops pages from cache after use (LU-11282) Improved efficiency of ZFS IO pipeline Integrate with ABD in ZFS 0.8 to avoid memcpy() of data Further improvements with continued investigation/development DONE TODO 10

FLR Enhancements (WC 2.12+) Continuation of FLR feature landed in Lustre 2.11 (LU-9771) FLR-aware OST object allocator to avoid replicas on same OST/OSS (LU-9007) Improve "lfs mirror resync" performance (LU-10916) Optimize multi-mirror resync (read data once, write multiple mirrors) "lfs mirror read" to dump specific mirror/version of file (LU-11245) "lfs mirror write" for script-based resync (LU-10258) Mirror NOSYNC flag + timestamp to allow file version/snapshot (LU-11400) Improved replica selection at runtime (LU-10158) Select best write replica (PREFERRED, SSD vs. HDD, near to client), read (many mirror vs. few) Allow specifying fault domains for OSTs (e.g. rack, PSU, network switch, etc.) Mount client directly on OSS for improved resync performance (LU-10191) Support DoM components (LU-10112) Pool or OST/MDT quotas (LU-11023) Replica 0 Track/restrict space usage on flash OSTs/MDTs Object j (PRIMARY, PREFERRED) DONE TODO Replica 1 Object k (STALE) delayed resync 11

Tiered Storage with FLR Layouts (2.12/2.13) Integration with job scheduler and workflow for file prestage/drain/archive Policy engine to manage migration between tiers, rebuild replicas, ChangeLogs Policies for pathname, user, extension, age, OST pool, mirror copies,... FLR provides mechanism for safe migration of (potentially in-use) data RBH or LiPE are good starting points for this Needs userspace integration and Lustre hooks Integrated burst buffers a natural starting point Mirror to flash, mark PREFERRED for read/write Resync modified files off flash, release space Need OST/MDT/pool quotas to manage tiers Metadata Servers (~100 s) Management Target (MGT) Metadata Targets (MDTs) Object Storage Servers (1000s) Object Storage Targets (OSTs) (Warm Tier SAS/SSD) Archive OSTs/Tape (Cold Tier Erasure Code) Policy Engine NVMe MDTs on client network Lustre Clients (~50,000+) NVMe Burst Buffer/Hot Tier OSTs on client network 12

FLR Erasure Coded Files (LU-10911 WC 2.13+) Erasure coding adds redundancy without 2x/3x overhead Add erasure coding to new/old striped files after write done Use delayed/immediate mirroring for files being actively modified For striped files - add N parity per M data stripes (e.g. 16d+3p) Parity declustering avoids IO bottlenecks, CPU overhead of too many parities o e.g. split 128-stripe file into 8x (16 data + 3 parity) with 24 parity stripes dat0 dat1... dat15 par0 par1 par2 dat16 dat17... dat31 par3 par4 par5... 0MB 1MB... 15M p0.0 q0.0 r0.0 16M 17M... 31M p1.0 q1.0 r1.0... 128 129... 143 p0.1 q0.1 r0.1 144 145... 159 p1.1 q1.1 r1.1... 256 257... 271 p0.2 q0.2 r0.2 272 273... 287 p1.2 q1.2 r1.2... 13

Persistent Client Cache (PCC) (LU-10092 DDN, WC 2.13+) Reduce latency, improve small/unaligned IOPS, reduce network traffic PCC integrates Lustre with persistent per-client local cache devices Each client has own cache (SSD/NVMe/NVRAM) as a local filesystem (e.g. ext4/ldiskfs) No global/visible namespace is provided by PCC, data is local to client only Existing files pulled into PCC by HSM copytool per user directive, job script, or policy New files created in PCC is also created on Lustre MDS Kernel uses local file if in cache or normal Lustre IO Further file read/write access directly to local data No data/iops/attributes leave client while file in PCC File migrated out of PCC via HSM upon remote access llite PCC Switcher Normal Cache I/O I/O NVMe CT #1 Fetch OSCs llite PCC Switcher Normal Cache I/O I/O NVMe CT #2 Fetch OSCs Separate functionality read vs. write file cache Could later integrate with DAX for NVRAM storage OST OST OST 14

Client Metadata Writeback Cache (WBC) (LU-10983 2.14+) Metadata WBC creates new files in RAM in new directory Avoid RPC round-trips for each open/create/close Lock directory exclusively, avoid other DLM locking Cache file data only in pagecache until flush Flush tree incrementally to MDT/OST in background batches Could prefetch directory contents for existing directory Can integrate with PCC to avoid initial MDS create Early WBC prototype in progress Discussions underway for how to productize it Early results show 10-20ximprovement for some workloads Client MDS Client 15

Client Container Image (CCI) (2.14+) 16 Filesystem images have had ad hoc with Lustre in the past Read-only cache of many small files manually mounted on clients Root filesystem images for diskless clients Container Image is local ldiskfs image mounted on client Holds a whole directory tree stored as a single Lustre file CCI integrates container handling with Lustre Mountpoint is registered with Lustre for automatic mount Automatic local loopback mount of image at client upon access Image file read on demand from OST(s) and/or cached in PCC Low I/O overhead, few file extent lock(s), high IOPS per client Access, migrate, replicate image with large read/write to OST(s) MDS can mount and re-export image files for shared use CCI can hold archive of whole directory tree for HSM Unregister/delete image and all files therein with a few ops Client MDS Client OSS

DNE Metadata Redundancy (2.14+) New directory layout hash for mirrored directories, mirrored MDT inodes Each dirent copy holds multiple MDT FIDs for inodes Store dirent copy on each mirror directory shard Name lookup on any MDT can access via any FID Copies of mirrored inodes stored on different MDTs DNE2 distributed transaction for update recovery Ensure that copies stay in sync on multiple MDTs Mirror policy per-filesystem/subtree, is flexible Flexible MDT space/load balancing with striped dirs Early design work started, discussions ongoing Parent Dir Sub Dir home Dir 2:A foo FID 2:A FID 2:1 Inode LOV layout FID 2:1 FID 5:B FID 5:2 home Dir 5:B foo FID 2:A Inode FID 2:1 LOV layout FID 5:2 FID 5:B FID 5:2 18 MDT0002 MDT0005