CLIO. Nikita Danilov Senior Staff Engineer Lustre Group

Similar documents
Remote Directories High Level Design

Managing Lustre TM Data Striping

Nathan Rutman SC09 Portland, OR. Lustre HSM

Lustre A Platform for Intelligent Scale-Out Storage

Buffer Management for XFS in Linux. William J. Earl SGI

Idea of Metadata Writeback Cache for Lustre

Tutorial: Lustre 2.x Architecture

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

OCFS2: Evolution from OCFS. Mark Fasheh Senior Software Developer Oracle Corporation

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Chapter 4: Threads. Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads

Lustre overview and roadmap to Exascale computing

DNE2 High Level Design

OCFS2 Mark Fasheh Oracle

Andreas Dilger, Intel High Performance Data Division LAD 2017

Lustre Capability DLD

ECE 598 Advanced Operating Systems Lecture 19

To Everyone... iii To Educators... v To Students... vi Acknowledgments... vii Final Words... ix References... x. 1 ADialogueontheBook 1

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Imperative Recovery. Jinshan Xiong /jinsan shiung/ 2011 Whamcloud, Inc.

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

bug11300: Interval Tree for Extent Lock HLD

Background: I/O Concurrency

WITH THE LUSTRE FILE SYSTEM PETA-SCALE I/O. An Oak Ridge National Laboratory/ Lustre Center of Excellence Paper February 2008.

DAOS Lustre Restructuring and Protocol Changes Design FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

ADVANCED OPERATING SYSTEMS

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

Chapter 11: Implementing File-Systems

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

5.4 - DAOS Demonstration and Benchmark Report

Virtual File System -Uniform interface for the OS to see different file systems.

Chapter 12 File-System Implementation

What is a file system

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

LLNL Lustre Centre of Excellence

OPERATING SYSTEM. Chapter 4: Threads

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Lecture 10: Crash Recovery, Logging

MEMBRANE: OPERATING SYSTEM SUPPORT FOR RESTARTABLE FILE SYSTEMS

Chapter 4: Threads. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Networking Subsystem in Linux. Manoj Naik IBM Almaden Research Center

NFS-Ganesha w/gpfs FSAL And Other Use Cases

CS 111. Operating Systems Peter Reiher

Commit-On-Sharing High Level Design

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Demonstration Milestone for Parallel Directory Operations

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. November 6, Prof. Joe Pasquale

CSE 153 Design of Operating Systems

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Chapter 4: Multi-Threaded Programming

416 Distributed Systems. Distributed File Systems 2 Jan 20, 2016

Lecture 2 Distributed Filesystems

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. February 22, Prof. Joe Pasquale

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

Lock Ahead: Shared File Performance Improvements

Efficient interaction between Lustre and ZFS for compression

Last class: Today: Thread Background. Thread Systems

CMD Code Walk through Wang Di

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

Operating Systems. Week 13 Recitation: Exam 3 Preview Review of Exam 3, Spring Paul Krzyzanowski. Rutgers University.

Operating Systems Design Exam 2 Review: Spring 2011

File-System Structure

CS 416: Operating Systems Design April 22, 2015

Chapter 4: Multithreaded Programming. Operating System Concepts 8 th Edition,

CS 416: Opera-ng Systems Design March 23, 2012

Lustre * 2.12 and Beyond Andreas Dilger, Intel High Performance Data Division LUG 2018

Operating Systems 2010/2011

File System Internals. Jo, Heeseung

The HAMMER Filesystem DragonFlyBSD Project Matthew Dillon 11 October 2008

RCU. ò Walk through two system calls in some detail. ò Open and read. ò Too much code to cover all FS system calls. ò 3 Cases for a dentry:

File System (Internals) Dave Eckhardt

VFS, Continued. Don Porter CSE 506

Operating Systems Design Exam 2 Review: Fall 2010

The Google File System

Object based Storage Cluster File Systems & Parallel I/O

Developments in GFS2. Andy Price Software Engineer, GFS2 OSSEU 2018

Lecture 18: Reliable Storage

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

NFS Version 4 Open Source Project. William A.(Andy) Adamson Center for Information Technology Integration University of Michigan

Lustre SMP scaling. Liang Zhen

CS5460: Operating Systems Lecture 20: File System Reliability

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

Computer Systems Laboratory Sungkyunkwan University

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

File Systems: Consistency Issues

Volume Management in Linux with EVMS

Distributed Metadata Management for Parallel Filesystems

NPTEL Course Jan K. Gopinath Indian Institute of Science

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

OPERATING SYSTEM. Chapter 12: File System Implementation

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Operating Systems. Week 9 Recitation: Exam 2 Preview Review of Exam 2, Spring Paul Krzyzanowski. Rutgers University.

GPUfs: Integrating a file system with GPUs

Final Exam Preparation Questions

Transcription:

CLIO Nikita Danilov Senior Staff Engineer Lustre Group 1

Problems with old client IO path Old code with a lot of obscurities; Inter-layer assumptions: lov, llite; Based on a huge obd-interface; Not easily expandable; Traditionally many bugs; High overhead: allocations, kms; Not very portable; Different from md server stack; Interaction between layers is not clear; 2

Client IO path clean-up requirements Reduce number of bugs in the IO path. Build an infrastructure for planned features. Clean up old code and interfaces. Specify precisely > state machines > Interactions > pre-, post-conditions and invariants. Don't lose performance. 3

CLIO design goals clear layering; controlled state sharing; simplified layer interface; real stacking; support for: > SNS; > read-only p2p caching; > lock-less IO and ost intents; reuse of mdt server layering; improved portability. 4

CLIO restrictions Not a complete re-write: only interface cleanup No new features, only new interfaces Only data IO path. Meta-data are left intact No changes to the DLM No changes to the recovery No changes to the RPC formation logic 5

Layers vvp: VFS and VM, converting system calls (read, write) into client api. POSIX operation semantics (short reads, flock, O_APPEND), read-ahead. This is mostly old llite. lov: raid0 striping: files/stripes, locks/sublocks sns: raid with parity declustering. raid-frame library (not implemented) osc: ptlrpc/lnet, DLM, RPC formation 6

Layers user space sys_read(fd, buf, count); sys_write(fd, buf, count); fstat(fd, &statbuf); vm-vfs-posix (vvp, linux) pos pos + count read-ahead liblustre file: solaris libclient (ccc) lov osc osc queue 7

Fundamental data-types cl_object a file and a stripe, based on lu_object (fids, hashing, lru), caches pages, and tracks locks, can be composed of other objects. cl_page fixed-size chunk of file data, can be part of multiple cl_object's: file, and stripe of this file; uniquely identified by (object, index). cl_lock a region of a particular object, covered by a cluster lock. cl_io IO context. Contains IO state. Can be stopped, resumed. Collects locks and owns pages. cl_req RPC 8

Layered objects Compound object: a header and a sequence of layers Layer-private state Per-layer operation vectors Generic code invokes operations on each layer, delegating behavior header lev1 data slice lev2 data slice lev3 data slice -> op0() = lev1_op0() -> op1() -> op0() = lev2_op0() -> op1() -> op0() = lev3_op0() -> op1() 9

cl_page cl_page cached file data diffrent sizes of pages (DMU blocks) interaction with MM/VM (page locking, memory pressure) simple locking model, suitable for Lustre struct page vvp_page cl_page_slice -> cpo_com pletion() -> cpo_prep() lov_page cl_page_slice -> cpo_com pletion() -> cpo_prep() osc page queues osc_page cl_page_slice -> cpo_com pletion() -> cpo_prep() 10

cl_page state machine 11

cl_lock cl_lock extent read/write lock on file data top-lock on a file sub-locks on stripes based on DLM vvp_lock cl_lock_slice lov_lock -> clo_enqueue() -> clo_w ait() -> clo_enqueue() -> clo_w ait() cl_lock_slice sub-locks cl_lock -> clo_enqueue() -> clo_w ait() osc_lock cl_lock_slice struct ldlm _lock -> cpo_com pletion() 12

cl_lock state machine top-lock and sub-locks (N:M); caching of top-locks (short IO); non-blocking, asynchronous state-machine (parallel IO); 13

cl_object cl_object -pages -locks top-object: file sub-object: stripe uniquely identified by fid cached LRU tree of pages list of locks trivial state machine inode vvp_object cl_object_slice lov_object cl_object_slice -> coo_attr_get() -> coo_attr_set() -> coo_attr_get() -> coo_attr_set() -> coo_attr_get() -> coo_attr_set() sub-objects cl_object -pages -locks osc_object struct lov_oinfo cl_object_slice 14

Files, locks, pages. 15

cl_io cl_io high level IO context for IO activity fast: no dynamic allocation state machine designed for parallel IO -> cio_lock() -> cio_start() vvp_io -> cio_lock() -> cio_start() lov_io sub-io's cl_io -> cio_lock() -> cio_start() osc_io 16

cl_io state machine init: initialize layers prep: form iteration lock: collect locks start: do IO end: wait completion DONE io loop PREP LOCK UNLOCK INIT END START 17

Use cases: file read sys_read(fd, buf, count); vvp: cl_io_init(); cl_io_prep(): > lov_io_pre(): clip IO to one stripe cl_io_lock(): > vvp_io_lock(): lock [pos, pos+count-1] > sns_io_lock(): lock parity block > generic: sort locks, enqueue cl_io_start(): > vvp_io_start(): generic_file_read() vvp_readpage(): read-ahead, cl_submit_io(): sns_submit_io(): parity ordering osc_submit_io(): send RPC 18

Use cases: lock cancellation osc lock gets cancel find object find subtree of pages call cancel on each page, up to llite much simpler than existing logic cl_lock cl_object vvp_lock vvp_object lov_lock lov_object cl_page vvp_page lov_page cl_lock cl_object osc_lock osc_object osc_page network cancel request 19

Clio, Greek:?????, /'klaiou/, n: the muse of history also known as the Proclaimer. The name is from the root????, meaning recount or make famous. Nikita.Danilov@sun.com 20