Lustre HPCS Design Overview. Andreas Dilger Senior Staff Engineer, Lustre Group Sun Microsystems

Similar documents
Lustre TM. Scalability

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Lustre overview and roadmap to Exascale computing

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Andreas Dilger, Intel High Performance Data Division LAD 2017

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

The Google File System

DNE2 High Level Design

An Overview of Fujitsu s Lustre Based File System

Architecting a High Performance Storage System

Remote Directories High Level Design

Introduction to Lustre* Architecture

An Exploration of New Hardware Features for Lustre. Nathan Rutman

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

CS370 Operating Systems

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

The Google File System

Example Implementations of File Systems

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

SFA12KX and Lustre Update

Toward An Integrated Cluster File System

CSE 124: Networked Services Lecture-16

Distributed Filesystem

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

Distributed File Systems II

Fast Forward I/O & Storage

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Current Topics in OS Research. So, what s hot?

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

File systems CS 241. May 2, University of Illinois

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

CLOUD-SCALE FILE SYSTEMS

Dynamic Metadata Management for Petabyte-scale File Systems

Distributed System. Gang Wu. Spring,2018

Cloudian Sizing and Architecture Guidelines

CS5460: Operating Systems Lecture 20: File System Reliability

<Insert Picture Here> Lustre Development

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CSE 124: Networked Services Fall 2009 Lecture-19

Efficient Object Storage Journaling in a Distributed Parallel File System

The Google File System

Let s Make Parallel File System More Parallel

The Btrfs Filesystem. Chris Mason

The Google File System (GFS)

GFS: The Google File System

Disk Scheduling COMPSCI 386

CS 537 Fall 2017 Review Session

CA485 Ray Walshe Google File System

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Lustre * 2.12 and Beyond Andreas Dilger, Intel High Performance Data Division LUG 2018

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Computer Systems Laboratory Sungkyunkwan University

Using DDN IME for Harmonie

An introduction to GPFS Version 3.3

Chapter 10: Mass-Storage Systems

The Google File System

File System Implementation. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Distributed Systems 16. Distributed File Systems II

A file system is a clearly-defined method that the computer's operating system uses to store, catalog, and retrieve files.

Google File System. Arun Sundaram Operating Systems

Enhancing Lustre Performance and Usability

Challenges in making Lustre systems reliable

File System Internals. Jo, Heeseung

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Network Request Scheduler Scale Testing Results. Nikitas Angelinas

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

HP AutoRAID (Lecture 5, cs262a)

Lecture 2 Distributed Filesystems

High Performance Computing. NEC LxFS Storage Appliance

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

GFS: The Google File System. Dr. Yingwu Zhu

BigData and Map Reduce VITMAC03

Chapter 11: File System Implementation. Objectives

Fujitsu's Lustre Contributions - Policy and Roadmap-

The Google File System

Chapter 11: Implementing File Systems

File System Implementation

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Imperative Recovery. Jinshan Xiong /jinsan shiung/ 2011 Whamcloud, Inc.

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

WITH THE LUSTRE FILE SYSTEM PETA-SCALE I/O. An Oak Ridge National Laboratory/ Lustre Center of Excellence Paper February 2008.

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS370 Operating Systems

IME Infinite Memory Engine Technical Overview

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

SolidFire and Pure Storage Architectural Comparison

IBM. Systems management Disk management. IBM i 7.1

COMP 530: Operating Systems File Systems: Fundamentals

MapReduce. U of Toronto, 2014

The Google File System

Transcription:

Lustre HPCS Design Overview Andreas Dilger Senior Staff Engineer, Lustre Group Sun Microsystems 1

Topics HPCS Goals HPCS Architectural Improvements Performance Enhancements Conclusion 2

HPC Center of the Future Ca p a b ility 5 0 0, 0 0 0 Nod e s Ca p a c ity 1 2 5 0, 0 0 0 Nod e s Ca p a c ity 2 1 5 0, 0 0 0 Nod e s Ca p a c ity 3 5 0,0 0 0 Nod e s Te s t 2 5,0 0 0 Nod e s Viz 1 Viz 2 WAN Ac c e s s S h a re d S tora g e Ne tw ork 1 0 TB /s e c HP S S Arc h iv e Us e r Da ta 1 0 0 0 's OS Ts Me ta d a ta 1 0 0 's MDTs

HPCS Goals - Capacity Filesystem Limits 100 PB+ maximum file system size (10PB) 1 trillion files (1012) per file system (4 billion files) > 30k client nodes (> 30k clients) Single File/Directory Limits 10 billion files per directory (15M) 0 to 1 PB file size range (360TB) 1B to 1 GB I/O request size range (1B - 1GB IO req) 100,000 open shared files per process (10,000 files) Long file names and 0-length files as data records (long file names and 0-length files) 4

HPCS Goals - Performance Aggregate 10,000 metadata operations per second (10,000/s) 1500 GB/s file per process/shared file (180GB/s) No impact RAID rebuilds on performance (variable) Single Client 40,000 creates/s, up to 64kB data (5k/s, 0kB) 30 GB/s full-duplex streaming I/O (2GB/s) Miscellaneous POSIX I/O API extensions proposed at OpenGroup* (partial) * High End Computing Extensions Working Group http://www.opengroup.org/platform/hecewg/ 5

HPCS Goals - Reliability End-to-end resiliency T10 DIF equivalent (net only) No impact RAID rebuild on performance (variable) Uptime of 99.99% (99%?) Downtime < 1h/year 100h filesystem integrity check (8h, partial) 1h downtime means check must be online 6

HPCS Architectural Improvements Use Available ZFS Functionality End-to-End Data Integrity RAID Rebuild Performance Impact Filesystem Integrity Checking Clustered Metadata Recovery Improvements Performance Enhancements 7

Use Available ZFS Functionality Capacity Single filesystem 100TB+ (264 LUNs * 264 bytes) Trillions of files in a single file system (248 files) Dynamic addition of capacity Reliability and resilience Transaction based, copy-on-write Internal data redundancy (double parity, 3 copies) End-to-end checksum of all data/metadata Online integrity verification and reconstruction Functionality Snapshots, filesets, compression, encryption Online incremental backup/restore Hybrid storage pools (HDD + SSD) 8

End-to-End Data Integrity Current Lustre checksumming Detects data corruption over network Ext3/4 does not checksum data on disk ZFS stores data/metadata checksums Fast (Fletcher-4 default, or none) Strong (SHA-256) HPCS Integration Integrate Lustre and ZFS checksums Avoid recompute full checksum on data Always overlap checksum coverage Use scalable tree hash method 9

Hash Tree and Multiple Block Sizes 1024kB LUSTRE RPC SIZE ROOT HASH 512kB 256kB 128kB ZFS BLOCK 64kB MAX PAGE 32kB 16kB MIN PAGE 8kB 4kB INTERIOR HASH LEAF HASH DATA SEGMENT 10

Hash Tree For Non-contiguous Data LH(x) = hash of data x (leaf) IH(x) = hash of data x (interior) x+y = concatenation x and y K C01 C45 C01 C0 C1 C457 C4 C7 C5 C7 C0 = LH(data segment 0) C1 = LH(data segment 1) C4 = LH(data segment 4) C5 = LH(data segment 5) C7 = LH(data segment 7) C01 = IH(C0 + C1) C45 = IH(C4 + C5) C457 = IH(C45 + C7) K = IH(C01 + C457) = ROOT hash 11

End-to-End Integrity Client Write 12

End-to-End Integrity Server Write 13

HPC Center of the Future Ca p a b ility 5 0 0, 0 0 0 Nod e s Ca p a c ity 1 2 5 0,0 0 0 Nod e s Ca p a c ity 2 1 5 0,0 0 0 Nod e s Ca p a c ity 3 5 0,0 0 0 Nod e s Te s t 2 5,0 0 0 Nod e s Viz 1 Viz 2 WAN Ac c e s s S h a re d S tora g e Ne tw ork 1 0 TB /s e c 115 PB Data 36720 4TB disks RAID-6 8+2 1224 96TB OSTs 3 * 32TB LUNs HPS S Arc h iv e Me ta d a ta 1 0 0 's MDTs

RAID Failure Rates 115PB filesystem, 1.5TB/s 36720 4TB disks in RAID 6 8+2 LUNs 4 year Mean Time To Failure 1 disk fails every hour on average 4TB disk @ 30MB/s 38hr rebuild 30MB/s is 50%+ disk bandwidth (seeks) May reduce aggregate throughput by 50%+ 1 disk failure may cost 750GB/s aggregate 38hr * 1 disk/hr = 38 disks/osts degraded

RAID Failure Rates 36720 4TB disks, RAID-6 8+2, 1224 OSTs, 30MB/s rebuild 100 90 80 Disks failed during rebuild 70 Percent OST degraded Days to OST double failure 60 Years to OST triple failure 50 Failure Rate Failure Rates vs. MTTF 40 30 20 10 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 MTTF (years) 4.5 5.0

36720 4TB disks, RAID-6 N+2, 1224 OSTs, MTTF 4 years 100 90 80 Disks failed during rebuild 70 Percent OST degraded Days to OST double failure 60 Years to OST triple failure 50 Failure Rate Failure Rates vs. RAID disks 40 30 20 10 0 0 5 10 15 20 25 Disks per RAID-6 stripe 30 35

36720 4TB disks, RAID-6 8+2, 1224 OSTs, MTTF 4 years 100 90 80 70 Disks failed during rebuild Percent OST degraded 60 Days to OST double failure Years to OST triple failure 50 Failure Rate Failure Rates vs. Rebuild Speed 40 30 20 10 0 0 10 20 30 40 50 60 70 80 Rebuild Speed (MB/s) 90 100

Lustre-level Rebuild Mitigation Several related problems Avoid global impact from degraded RAID Avoid load on rebuilding RAID set Avoids degraded OSTs for new files Little or no load on degraded RAID set Maximize rebuild performance Minimal global performance impact 30 disks (3 LUNs) per OST, 1224 OSTs 38 of 1224 OSTS = 3% aggregate cost OSTs available for existing files 19

ZFS-level RAID-Z Rebuild RAID-Z/Z2 is not the same as RAID-5/6 NEVER does read-modify-write Supports arbitrary block size/alignment RAID layout is stored in block pointer ZFS metadata traversal for RAID rebuild Good: only rebuild used storage (<80%) Good: verify checksum of rebuilt data Bad: may cause random disk access

RAID-Z Rebuild Improvements RAID-Z optimized rebuild ~3% of storage is metadata Scan metadata first, build ordered list Data reads mostly linear Bookmark to restart rebuild ZFS itself is not tied to RAID-Z Distributed hot space Spread hot-spare rebuild space over all disks All disks' bandwidth/iops for normal IO All disks' bandwidth/iops for rebuilding

HPC Center of the Future Ca p a b ility 5 0 0, 0 0 0 Nod e s Ca p a c ity 1 2 5 0,0 0 0 Nod e s Ca p a c ity 2 1 5 0,0 0 0 Nod e s Ca p a c ity 3 5 0,0 0 0 Nod e s Te s t 2 5,0 0 0 Nod e s Viz 1 Viz 2 WAN Ac c e s s S h a re d S tora g e Ne tw ork 1 0 TB /s e c 2 PB Us e r Da ta 1 0 0 0 's OS Ts 5120 800GB SAS/SSD RAID-1 128 MDTs 16TB LUNs HPS S Arc h iv e

Clustered Metadata 100s of metadata servers Distributed inodes Files normally local to parent directory Subdirectories often non-local Split directories Split dir::name hash Striped file::offset Distributed Operation Recovery Cross directory mkdir, rename, link, unlink Strictly ordered distributed updates Ensures namespace coherency, recovery At worst inode refcount too high, leaked 23

Filesystem Integrity Checking Problem is among hardest to solve 1 trillion files in 100h 2-4PB of MDT filesystem metadata ~3 million files/sec, 3GB/s+ for one pass 3M*stripes checks/sec from MDSes to OSSes 860*stripes random metadata IOPS on OSTs Need to handle CMD coherency as well Link count on files, directories Directory parent/child relationship Filename to FID to inode mapping 24

Filesystem Integrity Checking Integrate Lustre with ZFS scrub/rebuild ZFS callback to check Lustre references Event-driven checks means fewer re-reads ldiskfs can use an inode table iteration Back-pointers to allow direct verification Pointer from OST object to MDT inode Pointer list from inode to {parent dir, name} No saved state needed for coherency check About 1 bit/block to detect leaks/orphans Or, a second pass in reverse direction 25

Recovery Improvements Version Based Recovery Independent recovery stream per file Isolate recovery domain to dependent ops Commit on Share Avoid client getting any dependent state Avoid sync for single client operations Avoid sync for independent operations

Imperative Recovery Server driven notification of failover Server notifies client of failover completed Client replies immediately to server Avoid client waiting on RPC timeouts Avoid server waiting for dead clients Can tell between slow/dead server No waiting for RPC timeout start recovery Can use external or internal notification

Performance Enhancements SMP Scalability Network Request Scheduler Channel Bonding 28

SMP Scalability Future nodes will have 100s of cores Need excellent SMP scaling on client/server Need to handle NUMA imbalances Remove contention on servers Per-CPU resources (queues, locks) Fine-grained locking Avoid cross-node memory access Bind requests to a specific CPU deterministically Client NID, object ID, parent directory Remove contention on clients Parallel copy_{to,from}_user, checksums 29

Network Request Scheduler Much larger working set than disk elevator Higher level information Client NID, File/Offset Read & write queue for each object on server Requests sorted in object queues by offset Queues serviced round-robin, operation count (variable) Deadline for request service time Scheduling input: opcount, offset, fairness, delay Future enhancements Job ID, process rank Gang scheduling across servers Quality of service Per UID/GID, cluster: min/max bandwidth 30

Network Request Scheduler Q (Global FIFO Queue) qn q3 q2 q1 WQ req req req req req req req req req req Work Queue Data Object Queues 31

HPC Center of the Future Ca p a b ility 5 0 0, 0 0 0 Nod e s Ca p a c ity 1 2 5 0, 0 0 0 Nod e s Ca p a c ity 2 1 5 0, 0 0 0 Nod e s Ca p a c ity 3 5 0,0 0 0 Nod e s Te s t 2 5,0 0 0 Nod e s Viz 1 Viz 2 WAN Ac c e s s S h a re d S tora g e Ne tw ork 1 0 TB /s e c HP S S Arc h iv e Us e r Da ta 1 0 0 0 's OS Ts Me ta d a ta 1 0 0 's MDTs

Channel Bonding Combine multiple Network Interfaces Increased performance Shared: balance load across all interfaces Improved reliability Failover: use backup links if primary down Flexible configuration Network interfaces of different types/speeds Peers do not need to share all networks Configuration server per network 33

Conclusion Lustre can meet HPCS filesystem goals Scalability roadmap is reasonable Incremental orthogonal improvements HPCS provides impetus to grow Lustre Accelerate development roadmap Ensures that Lustre will meet future needs 34

Questions? HPCS Filesystem Overview online at: http://wiki.lustre.org/index.php/learn:lustre_publications 35

THANK YOU Andreas Dilger Senior Staff Engineer, Lustre Group Sun Microsystems 36

T10-DIF CRC-16 Guard Word All 1-bit errors All adjacent 2-bit errors Single 16-bit burst error 10-5 bit error rate 32-bit Reference Tag Misplaced write!= 2nTB Misplaced read!= 2nTB vs. Hash Tree Fletcher-4 Checksum All 1-2- 3-4-bit errors All errors affecting 4 or fewer 32-bit words Single 128-bit burst error 10-13 bit error rate Hash Tree Misplaced read Misplaced write Phantom write Bad RAID reconstruction

Metadata Improvements Metadata Writeback Cache Avoids unnecessary server communication Operations logged/cached locally Performance of local file system when uncontended Aggregated distributed operations Server updates batched and tranferred using bulk protocols (RDMA) Reduced network and service overhead Sub-Tree Locking Lock aggregation a single lock protects a whole subtree Reduce lock traffic and server load 38

Metadata Improvements Metadata Protocol Size on MDT (SOM) > Avoid multiple RPCs for attributes derived from OSTs > OSTs remain definitive while file open > Compute on close and cache on MDT Readdir+ > Aggregation - Directory I/O - Getattrs - Locking 39

LNET SMP Server Scaling RPC Finer Grain Locking Througput RPC Trhoughput Single Lock Total client processes Total client processes Client Nodes 40

Communication Improvements Flat Communications model Stateful client/server connection required for coherence and performance Every client connects to every server O(n) lock conflict resolution Hierarchical Communications Model Aggregate connections, locking, I/O, metadata ops Caching clients > Lustre System Calls > Aggregate local processes (cores) > I/O Forwarders scale another 32x or more Caching Proxies > Lustre Lustre > Aggregate whole clusters > Implicit Broadcast - scalable conflict resolution 41

Fault Detection Today RPC timeout > Timeout cannot distinguish death / congestion Pinger > No aggregation across clients or servers > O(n) ping overhead Routed Networks > Router failure confused with peer failure Fully automatic failover scales with slowest time constant > 10s of minutes on large clusters > Finer failover control could be much faster 42

Architectural Improvements Scalable Health Network Burden of monitoring clients distributed not replicated Fault-tolerant status reduction/broadcast network > Servers and LNET routers LNET high-priority small message support > Health network stays responsive Prompt, reliable detection > Time constants in seconds > Failed servers, clients and routers > Recovering servers and routers Interface with existing RAS infrastructure Receive and deliver status notification 43

Health Monitoring Network Primary Health Monitor Failover Health Monitor Client 44 44

Operations support Lustre HSM Interface from Lustre to hierarchical storage > Initially HPSS > SAM/QFS soon afterward Tiered storage Combine HSM support with ZFS s SSD support and a policy manager to provide tiered storage management 45