An Exploration of New Hardware Features for Lustre. Nathan Rutman

Similar documents
T10PI End-to-End Data Integrity Protection for Lustre

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

<Insert Picture Here> End-to-end Data Integrity for NFS

BTREE FILE SYSTEM (BTRFS)

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

<Insert Picture Here> DIF, DIX and Linux Data Integrity

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

High-Performance Lustre with Maximum Data Assurance

Lustre overview and roadmap to Exascale computing

CS370 Operating Systems

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

LLNL Lustre Centre of Excellence

Lustre Metadata Fundamental Benchmark and Performance

Linux Filesystems Ext2, Ext3. Nafisa Kazi

CS 318 Principles of Operating Systems

<Insert Picture Here> Linux Data Integrity

CS 537 Fall 2017 Review Session

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

2011/11/04 Sunwook Bae

The Btrfs Filesystem. Chris Mason

File System Internals. Jo, Heeseung

Computer Systems Laboratory Sungkyunkwan University

Case study: ext2 FS 1

W4118 Operating Systems. Instructor: Junfeng Yang

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

CS5460: Operating Systems Lecture 20: File System Reliability

Case study: ext2 FS 1

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

CS 111. Operating Systems Peter Reiher

Lustre A Platform for Intelligent Scale-Out Storage

ECE 598 Advanced Operating Systems Lecture 19

CS370 Operating Systems

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

Project Quota for Lustre

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Emulating Goliath Storage Systems with David

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Operating Systems. Operating Systems Professor Sina Meraji U of T

Advanced File Systems. CS 140 Feb. 25, 2015 Ali Jose Mashtizadeh

4/19/2016. The ext2 file system. Case study: ext2 FS. Recap: i-nodes. Recap: i-nodes. Inode Contents. Ext2 i-nodes

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

DL-SNAP and Fujitsu's Lustre Contributions

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

Locality and The Fast File System. Dongkun Shin, SKKU

ECE 598 Advanced Operating Systems Lecture 18

<Insert Picture Here> Filesystem Features and Performance

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Announcements. Persistence: Crash Consistency

Example Implementations of File Systems

Designing a True Direct-Access File System with DevFS

A File-System-Aware FTL Design for Flash Memory Storage Systems

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Ext3/4 file systems. Don Porter CSE 506

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

CS370 Operating Systems

(Not so) recent development in filesystems

Caching and reliability

ECE 598 Advanced Operating Systems Lecture 14

OS and HW Tuning Considerations!

Announcements. Persistence: Log-Structured FS (LFS)

Clotho: Transparent Data Versioning at the Block I/O Level

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS 550 Operating Systems Spring File System

File. File System Implementation. Operations. Permissions and Data Layout. Storing and Accessing File Data. Opening a File

SFA12KX and Lustre Update

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

FFS: The Fast File System -and- The Magical World of SSDs

Enterprise HPC storage systems

Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling

Mass-Storage Structure

CrashMonkey: A Framework to Systematically Test File-System Crash Consistency. Ashlie Martinez Vijay Chidambaram University of Texas at Austin

COMP 530: Operating Systems File Systems: Fundamentals

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

File System Implementation. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS3600 SYSTEMS AND NETWORKS

GFS: The Google File System

[537] RAID. Tyler Harter

COS 318: Operating Systems. Journaling, NFS and WAFL

Nathan Rutman SC09 Portland, OR. Lustre HSM

Enhancing Lustre Performance and Usability

Virtual File System -Uniform interface for the OS to see different file systems.

SCSI overview. SCSI domain consists of devices and an SDS

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

File Systems Management and Examples

Open Source Storage. Ric Wheeler Architect & Senior Manager April 30, 2012

[537] Fast File System. Tyler Harter

[537] Journaling. Tyler Harter

File System Consistency

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

Physical Disentanglement in a Container-Based File System

Transcription:

An Exploration of New Hardware Features for Lustre Nathan Rutman

Motivation Open-source Hardware-agnostic Linux Least-common-denominator hardware 2

Contents Hardware CRC MDRAID T10 DIF End-to-end data integrity Flash drive use Hybrid volumes HA and faster failover 3

Hardware CRC Nehalem CRC-32C BZ23549 LU-241 Shuichi Ihara s graphs from BZ23549 Intel Westmere, AMD Bulldozer (2011) 4 PCLMULQDQ (64bit carryless multiply) speed up CRC32, Adler

MDRAID Ongoing improvements in Linux SW RAID Hardening Zero copy writes Performance RAID 6, 6E, 10, etc Still to do Zero copy reads PDRAID Hardware parity math acceleration 5

T10DIF and End-to-End Data Integrity T10 6

T10 DIF 0 512 514 516 519 512 bytes of data GRD APP REF 16-bit guard tag (CRC of 512-byte data portion) HBA generates PI Application OS I/O Controller 16-bit application tag 32-bit reference tag Byte stream 512 byte sector 520 byte sector Figures from I/O Controller Data Integrity Extensions Martin K. Petersen, Oracle Linux Engineering SAN Disk Array Disk Drive 520 byte sector 520 byte sector 520 byte sector Xport CRC Sector CRC 7

T10 DIX User app or OS generates PI Application Byte stream OS 512 byte sector I/O Controller HBA merges data and PI scatterlists SAN 520 byte sector 520 byte sector Xport CRC Disk Array 520 byte sector Disk Drive 520 byte sector Sector CRC 8

T10 End-to-End Data Integrity Write Userspace computes GRD per sector (optional) Client recomputes GRD after kernel copy Client adds GRD to BIO bulk descriptor OST pulls BIO OST passes PI to ldiskfs MDRAID maps REF tags HBA takes SG lists of data and PI HBA recalculates GRD (retries from client) Disk verifies REF and GRD (retries from HBA) Read Client sets up PI and data buffers, sends BIO OST sets up PI and data buffers MDRAID requests data and parity blocks Disk verifies REF and GRD HBA maps data and PI to buffers MDRAID verifies parity, reconstructs corrupted data OST sends data Client verifies GRD Userspace reverifies GRD after kernel copy (optional)

Flash Drives T10 10

Flash Drives SSDs are fast, but not large Caching Lustre persistent data files last_rcvd, last_objid, open_write Metadata journals WIBs Lustre FS metadata Data Small files 11

EXT4 Hybrid Volumes T10 12

EXT4 Hybrid Volumes Problems today Metadata distributed around disk, breaking up large disk chunks and slowing fsck Small files treated the same as large files (e.g. same RAID level) Hybrid Volume A single filesystem spanning a group of local devices with different RAID striping, speeds, hardware, or access patterns Single device loss will kill the FS, so each device must be safe (RAID) Change allocator Put metadata together, all on fast RAID1 Put small files after this on RAID1 Put large files aligned on 1MB boundaries on RAID6 EXT4 online defragmenter can migrate them 13

EXT4 Hybrid Volume Layout flex_bg Group descriptors Inode Bitmaps Inode Tables Directories EAs bg Directories EAs Small files bg Large files BG0 BG1 BG2 BG3 journal MD small files large file stripes RAID 0

EXT4 Hybrid Volume Advantages Larger transfer sizes and reduced seek time for each type Don t need to skip over data to get to metadata, and vice-versa Eliminate seek time between types Leave the data volume read head waiting at the next 1MB boundary Leave the MD head waiting at the allocator bitmaps Avoid RAID6 parity recompute penalty for small files Smaller metadata volume makes SSD, RAID1 practical FSCK times reduced location implies content type (don t need to read the whole disk for each pass) ordered metadata is faster to read Multiple data volumes Can sleep volumes as needed

HA and Failover HW Monitoring and Imperative Recovery don t wait for Lustre timeouts more important in larger clusters Persistent RAM RAMdisk MDT Quick flush to SSD on power loss

Thank You T10 17