Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Similar documents
Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Lustre overview and roadmap to Exascale computing

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Parallel File Systems for HPC

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

An Overview of Fujitsu s Lustre Based File System

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Experiences with HP SFS / Lustre in HPC Production

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Xyratex ClusterStor6000 & OneStor

Efficient Object Storage Journaling in a Distributed Parallel File System

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Parallel File Systems Compared

An Exploration of New Hardware Features for Lustre. Nathan Rutman

Enhancing Lustre Performance and Usability

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Introduction to Lustre* Architecture

Lustre TM. Scalability

Toward a Windows Native Client (WNC) Meghan McClelland LAD2013

Demonstration Milestone for Parallel Directory Operations

File Systems for HPC Machines. Parallel I/O

Architecting a High Performance Storage System

Lustre HPCS Design Overview. Andreas Dilger Senior Staff Engineer, Lustre Group Sun Microsystems

LLNL Lustre Centre of Excellence

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Lustre A Platform for Intelligent Scale-Out Storage

CS370 Operating Systems

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Lustre Metadata Fundamental Benchmark and Performance

Johann Lombardi High Performance Data Division

SFA12KX and Lustre Update

CS370 Operating Systems

InfiniBand Networked Flash Storage

Using file systems at HC3

DELL Terascala HPC Storage Solution (DT-HSS2)

Lustre / ZFS at Indiana University

Remote Directories High Level Design

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Filesystems on SSCK's HP XC6000

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

Andreas Dilger, Intel High Performance Data Division SC 2017

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Coordinating Parallel HSM in Object-based Cluster Filesystems

Community Release Update

NetApp High-Performance Storage Solution for Lustre

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Dell TM Terascala HPC Storage Solution

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

<Insert Picture Here> Lustre Development

Andreas Dilger, Intel High Performance Data Division LAD 2017

The Leading Parallel Cluster File System

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

VERITAS Foundation Suite TM 2.0 for Linux PERFORMANCE COMPARISON BRIEF - FOUNDATION SUITE, EXT3, AND REISERFS WHITE PAPER

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

Parallel File Systems. John White Lawrence Berkeley National Lab

Network Request Scheduler Scale Testing Results. Nikitas Angelinas

Oak Ridge National Laboratory

Fujitsu s Contribution to the Lustre Community

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

An Introduction to GPFS

Alternatives to Solaris Containers and ZFS for Linux on System z

5.4 - DAOS Demonstration and Benchmark Report

FhGFS - Performance at the maximum

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Project Quota for Lustre

Quobyte The Data Center File System QUOBYTE INC.

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

Let s Make Parallel File System More Parallel

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

SUSE. High Performance Computing. Eduardo Diaz. Alberto Esteban. PreSales SUSE Linux Enterprise

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

Metadata Performance Evaluation LUG Sorin Faibish, EMC Branislav Radovanovic, NetApp and MD BWG April 8-10, 2014

CS370 Operating Systems

Lustre Update. Brent Gorda. CEO Whamcloud, Inc. Linux Foundation Collaboration Summit San Francisco, CA April 4, 2012

Fan Yong; Zhang Jinghai. High Performance Data Division

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Introducing Panasas ActiveStor 14

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Triton file systems - an introduction. slide 1 of 28

A Generic Methodology of Analyzing Performance Bottlenecks of HPC Storage Systems. Zhiqi Tao, Sr. System Engineer Lugano, March

Choosing Hardware and Operating Systems for MySQL. Apr 15, 2009 O'Reilly MySQL Conference and Expo Santa Clara,CA by Peter Zaitsev, Percona Inc

Chapter 11: Implementing File

High-Performance Lustre with Maximum Data Assurance

Lustre at the OLCF: Experiences and Path Forward. Galen M. Shipman Group Leader Technology Integration

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Transcription:

Lustre Paul Bienkowski 2bienkow@informatik.uni-hamburg.de Proseminar Ein-/Ausgabe - Stand der Wissenschaft 2013-06-03 1 / 34

Outline 1 Introduction 2 The Project Goals and Priorities History Who is involved? 3 Lustre Architecture Network Architecture Data Storage and Access Software Architecture 4 Performance Theoretical Limits Recent Improvements 5 Conclusion 6 References 2 / 34

What is Lustre parallel filesystem well-scaling (capacity and speed) based on Linux kernel optimized for clusters (many clients) 3 / 34

What is Lustre parallel filesystem well-scaling (capacity and speed) based on Linux kernel optimized for clusters (many clients) Linux cluster 3 / 34

The Project 1 Introduction 2 The Project Goals and Priorities History Who is involved? 3 Lustre Architecture Network Architecture Data Storage and Access Software Architecture 4 Performance Theoretical Limits Recent Improvements 5 Conclusion 6 References 4 / 34

Goals and Priorities Goals performance features stability features performance stability until 2007 it s a science project (prototype) 2010 used in high-performance production environments reproduced from [2] 4 / 34

History History Started as a research project in 1999 by Peter Braam Braam founds Cluster File Systems Lustre 1.0 released in 2003 Sun Microsystems aquires Cluster File Systems in 2007 Oracle Corporation aquires Sun Mircrosystems in 2010 Oracle ceases Lustre development, many new Organizations continue development, including Xyratex, Whamcloud, and more In 2012, Intel aquires Whamcloud In 2013, Xyratex purchases the original Lustre trademark from Oracle 5 / 34

Who is involved? Who is involved? Oracle no development, only pre-1.8 support Intel funding, preparing for exascale computing Xyratex hardware bundling OpenSFS (Open Scalable File Systems) keeping Lustre open EOFS (EUROPEAN Open File Systems) (community collaboration) FOSS Community many joined one of the above to help development (e.g. Braam works for Xyratex now) DDN, Dell, NetApp, Terascala, Xyratex storage hardware bundled with Lustre 6 / 34

Introduction The Project Lustre Architecture Performance Conclusion References Who is involved? Supercomputers Lustre File System is managing data on more than 50 percent of the top 50 supercomputers and seven of the top 10 supercomputers. hpcwire.com, 2008 [9] The biggest computer today (Titan by Cray, #1 on TOP500) uses Lustre. [13] 7 / 34

Lustre Architecture 1 Introduction 2 The Project Goals and Priorities History Who is involved? 3 Lustre Architecture Network Architecture Data Storage and Access Software Architecture 4 Performance Theoretical Limits Recent Improvements 5 Conclusion 6 References 8 / 34

Network Architecture Network Structure CLIENTS METADATA OBJECT STORAGE graph reproduced from [1] 8 / 34

Network Architecture Network Structure CLIENTS METADATA OBJECT STORAGE graph reproduced from [1] 8 / 34

Network Architecture Network Structure CLIENTS METADATA OBJECT STORAGE graph reproduced from [1] 8 / 34

Network Architecture Network Structure CLIENTS METADATA different network types: any TCP/GigE, InfiniBand, Cray Seastar, Myrinet MX, RapidArray ra, Quadrics Elan [5] OBJECT STORAGE graph reproduced from [1] 8 / 34

Network Architecture Network Structure CLIENTS METADATA MDS different network types: any TCP/GigE, InfiniBand, Cray Seastar, Myrinet MX, RapidArray ra, Quadrics Elan [5] OBJECT STORAGE graph reproduced from [1] 8 / 34

Network Architecture Metadata Server (MDS) store file information (metadata) accessed by clients to access files manage data storage at least one required multiple MDS possible (different techniques) recent focus for performance improvement 9 / 34

Network Architecture Network Structure CLIENTS METADATA MDS different network types: any TCP/GigE, InfiniBand, Cray Seastar, Myrinet MX, RapidArray ra, Quadrics Elan [5] OBJECT STORAGE graph reproduced from [1] 10 / 34

Network Architecture Network Structure CLIENTS METADATA MDS different network types: any TCP/GigE, InfiniBand, Cray Seastar, Myrinet MX, RapidArray ra, Quadrics Elan [5] OSS OBJECT STORAGE graph reproduced from [1] 10 / 34

Network Architecture Object Storage Server (OSS) store file content (objects) accessed by clients directly at least one required > 10, 000 OSS are used in large scale computers 11 / 34

Network Architecture Network Structure CLIENTS METADATA MDS different network types: any TCP/GigE, InfiniBand, Cray Seastar, Myrinet MX, RapidArray ra, Quadrics Elan [5] OSS OBJECT STORAGE graph reproduced from [1] 12 / 34

Network Architecture Network Structure CLIENTS METADATA MDT MDS different network types: any TCP/GigE, InfiniBand, Cray Seastar, Myrinet MX, RapidArray ra, Quadrics Elan [5] OSS OBJECT STORAGE OST graph reproduced from [1] 12 / 34

Network Architecture Targets two types object storage target (OST) metadata target (MDT) can be any block device normal hard disk / flash drive / SSD advanced storage arrays will be formatted for lustre up to 16 TiB / target (ext4 limit) 13 / 34

Network Architecture Failover if one server fails, another one takes over backup server needs access to targets enabled on-line software upgrades (one-by-one) 14 / 34

Network Architecture Network Structure CLIENTS METADATA MDT MDS different network types: any TCP/GigE, InfiniBand, Cray Seastar, Myrinet MX, RapidArray ra, Quadrics Elan [5] OSS OBJECT STORAGE OST graph reproduced from [1] 15 / 34

Network Architecture Network Structure CLIENTS METADATA MDT MDS different network types: any TCP/GigE, InfiniBand, Cray Seastar, Myrinet MX, RapidArray ra, Quadrics Elan [5] OSS OBJECT STORAGE OST graph reproduced from [1] 15 / 34

Network Architecture System characteristics Subsystem Typical number of systems Clients 1-100,000 1 GB/s I/O, 1000 metadata ops Performance Required attached storage Desirable hardware characteristics Object Storage 1-1,000 500 MB/s - 2.5 GB/s total capacity OSS count good bus bandwidth Metadata Storage 1 + backup (up to 100 with Lustre > 2.4) 3,000-15,000 metadata ops 1-2% of file system capacity adequate CPU power, plenty of memory table reproduced from [1] 16 / 34

Data Storage and Access Traditional Inodes used in many file system structures (e.g. ext3) each node has an index bijective mapping (file inode) contains metadata and data location (pointer) 17 / 34

Data Storage and Access Metadata (Lustre Inodes) Lustre uses similar structure inodes are stored on MDT inodes point to objects on OSTs file is striped across multiple OSTs inode stores information to these OSTs 18 / 34

Data Storage and Access Striping RAID-0 type striping data is split into blocks block size adjustable per file/directory OSTs store every n-th block (with n being number of OSTs involved) speed advantage (multiple simultaneous OSS/OST connections) capacity advantage (file bigger than single OST) A1 A2 A3 A4 A5 A6 A7 A8 B1 B2 B3 B4 OST1 OST2 OST3 A1 A2 A3 A4 A5 A6 A7 A8 B2 B3 B1 B4 19 / 34

Data Storage and Access Data Safety & Integrity data safety striping does not backup any data but for the targets, a RAID can be used in target RAIDs, a drive may fail (depends on RAID type) availability failovers ensure target reachability multiple network types/connections consistency lustre log (similar to journal) simultaneous write protection: LDLM (Lustre Distributed Lock Manager), distributed across OSS 20 / 34

Software Architecture Software Architecture - Server MDS/OSS has mkfs.lustre-formatted space ldiskfs kernel module required (based on ext4) kernel requires patching (only available for some Enterprise Linux 2.6 kernels, e.g. Red Hat) Limitations very platform dependent needs compatible kernel not a problem when using independent storage solution 21 / 34

Software Architecture Software Architecture - Client patchfree client: kernel module for Linux 2.6 userspace library (liblustre) userspace filesystem (FUSE) drivers NFS access (legacy support) Platform Support most Linux kernel versions > 2.6 supported NFS for Windows NFS/FUSE MacOS 22 / 34

Software Architecture Interversion Compatibility Lustre usually supports interoperability [6]. e.g. 1.8 clients 2.0 servers and vice versa on-line upgrade-ability using failover systems 23 / 34

Performance 1 Introduction 2 The Project Goals and Priorities History Who is involved? 3 Lustre Architecture Network Architecture Data Storage and Access Software Architecture 4 Performance Theoretical Limits Recent Improvements 5 Conclusion 6 References 24 / 34

Theoretical Limits Theoretical Limits A well designed Lustre storage system can achieve 90% of underlining hardware bandwidth. Zhiqi Tao, Sr. System Engineer, Intel [3] 24 / 34

Theoretical Limits Theoretical Limits A well designed Lustre storage system can achieve 90% of underlining hardware bandwidth. Example 160 OSS, 16 OST each, 2 TiB each Zhiqi Tao, Sr. System Engineer, Intel [3] 24 / 34

Theoretical Limits Theoretical Limits A well designed Lustre storage system can achieve 90% of underlining hardware bandwidth. Example 160 OSS, 16 OST each, 2 TiB each 2.5 PiB (Pebibyte) total storage Zhiqi Tao, Sr. System Engineer, Intel [3] 24 / 34

Theoretical Limits Theoretical Limits A well designed Lustre storage system can achieve 90% of underlining hardware bandwidth. Example 160 OSS, 16 OST each, 2 TiB each 2.5 PiB (Pebibyte) total storage each OST delivers 50 MiB/s Zhiqi Tao, Sr. System Engineer, Intel [3] 24 / 34

Theoretical Limits Theoretical Limits A well designed Lustre storage system can achieve 90% of underlining hardware bandwidth. Example 160 OSS, 16 OST each, 2 TiB each 2.5 PiB (Pebibyte) total storage each OST delivers 50 MiB/s 800 MiB/s combined throughput per OSS Zhiqi Tao, Sr. System Engineer, Intel [3] 24 / 34

Theoretical Limits Theoretical Limits A well designed Lustre storage system can achieve 90% of underlining hardware bandwidth. Example 160 OSS, 16 OST each, 2 TiB each 2.5 PiB (Pebibyte) total storage each OST delivers 50 MiB/s 800 MiB/s combined throughput per OSS stripe size 16 MiB Zhiqi Tao, Sr. System Engineer, Intel [3] 24 / 34

Theoretical Limits Theoretical Limits A well designed Lustre storage system can achieve 90% of underlining hardware bandwidth. Example 160 OSS, 16 OST each, 2 TiB each 2.5 PiB (Pebibyte) total storage each OST delivers 50 MiB/s 800 MiB/s combined throughput per OSS stripe size 16 MiB Zhiqi Tao, Sr. System Engineer, Intel [3] write 200 GiB file (80 stripes per OSS, 5 stripes per OST) 24 / 34

Theoretical Limits Theoretical Limits A well designed Lustre storage system can achieve 90% of underlining hardware bandwidth. Example 160 OSS, 16 OST each, 2 TiB each 2.5 PiB (Pebibyte) total storage each OST delivers 50 MiB/s 800 MiB/s combined throughput per OSS stripe size 16 MiB Zhiqi Tao, Sr. System Engineer, Intel [3] write 200 GiB file (80 stripes per OSS, 5 stripes per OST) 1.25 GiB per OSS, written in 1.6 seconds 24 / 34

Theoretical Limits Theoretical Limits A well designed Lustre storage system can achieve 90% of underlining hardware bandwidth. Example 160 OSS, 16 OST each, 2 TiB each 2.5 PiB (Pebibyte) total storage each OST delivers 50 MiB/s 800 MiB/s combined throughput per OSS stripe size 16 MiB Zhiqi Tao, Sr. System Engineer, Intel [3] write 200 GiB file (80 stripes per OSS, 5 stripes per OST) 1.25 GiB per OSS, written in 1.6 seconds all OSS parallel, total speed 125 GiB/s 24 / 34

Recent Improvements Recent Improvements wide striping OST/file limit extended > 160 OST possible inode xattrs ZFS support instead of ldiskfs on targets better kernel support more widely used better developed all advantages of ZFS (checksums, up to 256 ZiB 1 /OST, compression, copy-on-write) [12] multiple MDS metadata striping / namespacing metadata performance as bottleneck 1 kibi, mebi, gibi, tebi, pebi, exbi, zebi, yobi 25 / 34

Recent Improvements Metadata overhead Common Task readdir (directory traversal) and stat (file information) ls -l Problem Solution one stat call for every file, each is a RPC (POSIX). each RPC generates overhead and I/O wait Lustre detects readdir+stat and requests all stats from OSS in advance (parallel) a combined RPC reply is sent (up to 1 MB) 26 / 34

Recent Improvements Metadata overhead (cont d) 10,000 ls -l with subdir items improved original 10,000 ls -l with 4-striped files improved original Tranversal time (sec) 1,000 100 10 Tranversal time (sec) 1,000 100 1 2 4 8 16 Item count (100K) 10 1 2 4 8 16 Item count (100K) graph data from [4] 27 / 34

Recent Improvements Metadata overhead Common Task readdir (directory traversal) and stat (file information) ls -l Problem Solution one stat call for every file, each is a RPC (POSIX). each RPC generates overhead and I/O wait Lustre detects readdir+stat and requests all stats from OSS in advance (parallel) a combined RPC reply is sent (up to 1 MB) Alternative readdirplus from POSIX HPC I/O Extensions [11] 28 / 34

Recent Improvements SSDs as MDT Metadata often bottleneck SSDs have higher throughput SSDs achieve way more IOPS (important for metadata) only small capacity required (expensiveness!) 29 / 34

Recent Improvements SSDs as MDT Metadata often bottleneck SSDs have higher throughput SSDs achieve way more IOPS (important for metadata) only small capacity required (expensiveness!) Following Graphs: plot metadata access (create, stat, unlink) 8 processes per client-node HDD/SSD/RAM ldiskfs / ZFS (Orion-Lustre branch) data from [10] 29 / 34

Recent Improvements SSDs as MDT ldiskfs-hdd ldiskfs-ram ldiskfs-ssd ZFS-SSD Creates per second 20,000 10,000 0 Single Node 1 2 4 8 16 Client nodes 30 / 34

Recent Improvements SSDs as MDT 50,000 ldiskfs-hdd ldiskfs-ram ldiskfs-ssd ZFS-SSD 40,000 Stats per second 30,000 20,000 10,000 Single Node 1 2 4 8 16 Client nodes 31 / 34

Recent Improvements SSDs as MDT 20,000 ldiskfs-hdd ldiskfs-ram ldiskfs-ssd ZFS-SSD Unlinks per second 15,000 10,000 5,000 0 Single Node 1 2 4 8 16 Client nodes 32 / 34

Conclusion still heavily developed many interested/involved companies + funding actively used in HPC clusters well scalable throughput depends on network still improvements for metadata performance and ZFS required Linux 2.6 (Redhat Enterprise Linux, CentOS) only 33 / 34

References [1] http://www.raidinc.com/assets/documents/lustrefilesystem_wp.pdf 2013-05-17 [2] http://www.opensfs.org/wp-content/uploads/2011/11/rock-hard1.pdf 2013-05-17 [3] http://www.hpcadvisorycouncil.com/events/2013/switzerland-workshop/presentations/day_3/10_intel.pdf 2013-05-21 [4] http://storageconference.org/2012/presentations/t01.dilger.pdf 2013-05-21 [5] http://wiki.lustre.org/images/3/35/821-2076-10.pdf 2013-05-28 [6] http://wiki.lustre.org/index.php/lustre_interoperability_-_upgrading_from_1.8_to_2.0 2013-05-25 [7] http://wiki.lustre.org/index.php/faq_-_installation 2013-05-12 [8] https://wiki.hpdd.intel.com/display/pub/why+use+lustre 2013-05-21 [9] http://www.hpcwire.com/hpcwire/2008-11-18/suns_lustre_file_system_powers_top_supercomputers.html 2013-05-28 [10] http://www.isc-events.com/isc13_ap/presentationdetails.php?t=contribution&o=2119&a=select&ra= sessiondetails 2013-05-31 [11] http://www.pdl.cmu.edu/posix/docs/posix-io-sc05-asc.ppt 2013-05-31 [12] http://zfsonlinux.org/lustre.html 2013-05-31 [13] http://www.olcf.ornl.gov/wp-content/themes/olcf/titan/images/gallery/titan1.jpg 2013-06-02 34 / 34