Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Similar documents
Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Andreas Dilger, Intel High Performance Data Division LAD 2017

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Andreas Dilger, Intel High Performance Data Division SC 2017

Lustre * 2.12 and Beyond Andreas Dilger, Intel High Performance Data Division LUG 2018

Andreas Dilger High Performance Data Division RUG 2016, Paris

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

Jessica Popp Director of Engineering, High Performance Data Division

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Community Release Update

Community Release Update

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Lustre HSM at Cambridge. Early user experience using Intel Lemur HSM agent

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Intel & Lustre: LUG Micah Bhakti

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Enhancing Lustre Performance and Usability

Lustre overview and roadmap to Exascale computing

Data life cycle monitoring using RoBinHood at scale. Gabriele Paciucci Solution Architect Bruno Faccini Senior Support Engineer September LAD

Olaf Weber Senior Software Engineer SGI Storage Software. Amir Shehata Lustre Network Engineer Intel High Performance Data Division

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Zhang, Hongchao

Lustre HPCS Design Overview. Andreas Dilger Senior Staff Engineer, Lustre Group Sun Microsystems

LLNL Lustre Centre of Excellence

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Introduction to Lustre* Architecture

Lustre TM. Scalability

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

ServeRAID M5015 and M5014 SAS/SATA Controllers Product Guide

An Introduction to GPFS

Brent Gorda. General Manager, High Performance Data Division

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

The Transition to PCI Express* for Client SSDs

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Lustre / ZFS at Indiana University

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Fan Yong; Zhang Jinghai. High Performance Data Division

SFA12KX and Lustre Update

Lustre usages and experiences

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

An Overview of Fujitsu s Lustre Based File System

Fujitsu s Contribution to the Lustre Community

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Optimization of Lustre* performance using a mix of fabric cards

Lenovo RAID Introduction Reference Information

LNet Roadmap & Development. Amir Shehata Lustre * Network Engineer Intel High Performance Data Division

IBM System Storage DS8870 Release R7.3 Performance Update

SDSC s Data Oasis Gen II: ZFS, 40GbE, and Replication

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

InfiniBand Networked Flash Storage

The Google File System

Entry-level Intel RAID RS3 Controller Family

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

Distributed System. Gang Wu. Spring,2018

Challenges in making Lustre systems reliable

High-Performance Lustre with Maximum Data Assurance

DDN About Us Solving Large Enterprise and Web Scale Challenges

Architecting a High Performance Storage System

Multi-Rail LNet for Lustre

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Parallel File Systems Compared

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

DL-SNAP and Fujitsu's Lustre Contributions

High Performance Computing. NEC LxFS Storage Appliance

HPC Storage Use Cases & Future Trends

Fast Forward I/O & Storage

Using DDN IME for Harmonie

Robin Hood 2.5 on Lustre 2.5 with DNE

Brent Welch. Director, Architecture. Panasas Technology. HPC Advisory Council Lugano, March 2011

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Olaf Weber Senior Software Engineer SGI Storage Software

Fujitsu's Lustre Contributions - Policy and Roadmap-

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

DATA PROTECTION IN A ROBO ENVIRONMENT

Chapter 10: Mass-Storage Systems

An Exploration of New Hardware Features for Lustre. Nathan Rutman

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

Lustre Metadata Fundamental Benchmark and Performance

Johann Lombardi High Performance Data Division

OpenZFS Performance Improvements

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Extraordinary HPC file system solutions at KIT

5.4 - DAOS Demonstration and Benchmark Report

Upstreaming of Features from FEFS

Efficient Object Storage Journaling in a Distributed Parallel File System

outthink limits Spectrum Scale Enhancements for CORAL Sarp Oral, Oak Ridge National Laboratory Gautam Shah, IBM

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

Innovating and Integrating for Communications and Storage

CSCS HPC storage. Hussein N. Harake

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

High Performance Computing The Essential Tool for a Knowledge Economy

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Leading Parallel Cluster File System

Transcription:

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017 Statements regarding future functionality are estimates only and are subject to change without notice

Performance and Feature Improvement Highlights 2.10 is nearly here and has several long-awaited features Multi-Rail LNet for improved performance and reliability, continuing in 2.11 Progressive File Layout (PFL) for improved performance and usability ZFS related improvements (snapshots, create performance,...), continues in 2.11 2.11 looking to be a very interesting release Data-on-MDT for improved small file performance/latency File Level Redundancy (FLR Phase 1 delayed resync) for reliability/performance DNE directory restriping for ease of space balancing and DNE2 adoption 2.12/2.13 plans for continued functional and performance improvements File Level Redundancy continues (FLR Phase 2 immediate resync) DNE directory auto-split to improve usability and performance of DNE2 2

Lustre Performance Roughly How it Looks Today Management Target (MGT) Metadata Targets (MDTs) ZFS HDD 10k create/s/mds ldiskfs SSD 50k create/s/mds Object Storage Targets (OSTs) HDD/SSD 8GB/s/OSS SSD OSS 50k 4-32KB IOPS Metadata Servers (~10 s) Object Storage Servers (~1000 s) Management Network IML High Performance Data Network (OmniPath, FDR IB, 10GbE) 8GB/s Lustre Clients (~50,000) 3

Performance Targets for Lustre (2.11+) Management Target (MGT) Metadata Targets (MDTs) ZFS SSD 100k create/s/mds SSD MDS 50k 4-32KB/MDS small file OPS Object Storage Targets (OSTs) HDD/SSD 10-12GB/s/OSS SSD OSS 150k 4-32KB IOPS Metadata Servers (~10 s) Object Storage Servers (~1000 s) Management Network IML High Performance Data Network (OmniPath, EDR IB, 40GbE) 20GB/s+ with Multi-Rail Lustre Clients (~50,000) 4

ZFS Enhancements affecting Lustre (2.11) Lustre osd-zfs improvements to use ZFS more efficiently Improved file create performance (Intel) Update to ZFS 0.7.0 final release, if not in 2.10 Changes to core ZFS code (0.7.0+) QAT hardware-assisted checksums and compression (Intel) Improved kernel memory allocation for IO buffers (ABD) (others, Intel) Improved fault handling and notification (FMA/ZED/degraded mode) (Intel) Multi-mount protection (MMP) for improved HA safety (LLNL) Project quota accounting for ZFS (Intel) 5

New Features for ZFS (Intel, ANL ZFS 0.8+) Declustered parity RAID (draid) & distributed hot spaces Resilver across all drives in zpool, potentially improve performance by O(num_vdevs) Use bandwidth/iops of "hot spare" drives in normal operation rather than leave them idle or Trade off extra capacity of large drives for shorter risk window when drive has failed https://github.com/zfsonlinux/zfs/wiki/draid-howto Metadata allocation class Isolate small and large block sizes/lifetimes Store all metadata on dedicated SSD/NVRAM or Store all metadata in separate Metaslabs https://github.com/zfsonlinux/zfs/issues/3779 6

LNet Dynamic Discovery, Network Health (2.11/2.12) Builds on LNet Multi-Rail in Lustre 2.10 (Intel, SGI/HPE * ) LNet Dynamic Discovery (LU-9480) Automatically configure peers that share multiple LNet networks Avoids need for admin to specify Multi-Rail configuration for nodes LNet Network Health (LU-9120) Detect network interface, router faults Automatically handle fault conditions Restore connection when available 7

Data-on-MDT Small File Performance (Intel 2.11) Avoid OST overhead (data, lock RPCs) High-IOPS MDTs (mirrored SSD vs. RAID-6 HDD) Avoid contention with streaming IO to OSTs Prefetch file data with metadata Size on MDT for small files Integrates with PFL to simplify usage Start file on MDT, grow onto OSTs if larger Complementary with DNE 2 striped directories Scale small file IOPS with multiple MDTs https://jira.hpdd.intel.com/browse/lu-3285 Client open, write data, read, attr layout, lock, size, read data Small file IO directly to MDS Example DoM/PFL File Layout MDT [0, 1MB) 4 OST stripes [1MB, 1GB) 60 OST stripes [1GB, ) MDS 8

File Level Redundancy (Intel 2.11/2.12) Based on Progressive File Layout (PFL) feature in Lustre 2.10 (Intel, ORNL) Significant value and functionality added for HPC and other environments Optionally set on a per-file/dir basis (e.g. mirror input files and one daily checkpoint) Higher availability for server/network failure finally better than HA failover Robustness against data loss/corruption mirror (and later M+N erasure coding) Increased read speed for widely shared input files N-way mirror over many OSTs Mirror/migrate files over multiple storage classes NVRAM->SSD->HDD (e.g. Burst Buffer) Local vs. remote replicas (WAN) Partial HSM file restore Replica 0 Object j (primary, preferred) File versioning (no resync replica) Replica 1 Object k (stale) delayed resync Many more possibilities... 9

FLR Phased Implementation Approach Can implement Phases 2/3/4 in any order Phase 0: Composite Layouts from PFL project (Intel, ORNL 2.10) Plus OST pool inheritance, Project/Pool Quotas Phase 1: Delayed read-only mirroring depends on Phase 0 (Intel 2.11) Mirror and migrate data after file is written using external resync tool Phase 2: Immediate write mirroring depends on Phase 1 (Intel 2.12) Clients write to multiple mirrors directly Phase 3: Integration with policy engine/copytool depends on Phase 1 Automated migration between tiers using policy/space, parallel scan/resync Phase 4: Erasure coding for striped files depends on Phase 1 (2.13) Avoid 2x or 3x overhead of mirroring large files 10

Ongoing DNE Improvements (Intel 2.11/2.12) Directory migration from single to striped/sharded directories (LU-4684) Rebalance space usage, improve large directory performance Inodes are also migrated along with directory entries Automatic directory restriping to reduce/avoid need for explicit striping at create Start with single-stripe directory for low overhead of common case Add extra shards when master directory grows large enough (e.g. 64k entries) Existing entries stay in master, or are migrated to shards asynchronously? New entries+inodes created in new shards/mdts Performance scales as directory grows MDT Pools for space/class management? Master +4 dir shards +12 directory shards 11

Miscellaneous Improvements (2.11) Client Lock Ahead (Cray * ) Client (MPI-IO) to request read/write DLM locks before IO to avoid latency/contention Network Request Scheduler Token Bucket Filter (NRS-TBF) (DDN *, Uni Mainz) Improve flexibility of specifying TBF rules on the MDS and OSS; global scheduling (2.12?) Kernel tracepoints for logging/debugging/performance analysis (ORNL) Small file write investigation and optimizations (Intel) Reduce client (and RPC/server?) overhead for small (~8KB) writes Ldiskfs scalability improvements (Seagate * ) Support for > 256TB OSTs, > 10M directory entries; more than 4B files per FS? 12

Advanced Lustre * Research Intel Parallel Computing Centers Uni Hamburg + German Client Research Centre (DKRZ) Adaptive optimized ZFS data compression Client-side data compression integrated with ZFS GSI Helmholtz Centre for Heavy Ion Research TSM * HSM copytool for Lustre University of California Santa Cruz Automated client-side load balancing Johannes Gutenberg University Mainz Global adaptive Network Request Scheduler Allocate bandwidth to jobs at startup for Quality of Service Lawrence Berkeley National Laboratory Spark and Hadoop on Lustre 13

Lustre Client-side Data Compression (2.12?) Performance estimate with compression implemented inside IOR IPCC project: Uni Hamburg Client-side data compression Compressed in 32KB chunks Allows sub-block read/write Integrated with ZFS data blocks Code/disk format changes needed Avoid de-/re-compressing data Good performance/space benefits 14

Lustre Client-side Compression + Hardware Assist Intel Quick Assist Technology (QAT) (BGI genomics data) PCI card/chipset accelerator Compression improves performance QAT integrated with ZFS 0.7.0 Benchmark shows ZFS local perf Data from Beijing Genomics Institute 2 Intel Xeon E5 2620v3 QAT Adapter 8950 15

BGI Genomics data 2x Xeon E5 2620v3 QAT Adapter 8950 16

Development Roadmap Continues - 2.13 and Beyond Tiered storage with Composite Layouts and File Level Redundancy Policy engine to manage migration over tiers, rebuild replicas, ChangeLogs Policies for pathname, user, extension, age, OST pool, mirror copies,...? Multiple policy and scanning engines are being presented this week Lustre optimizations such as fast inode scanning could be implemented Internal OST maps on MDT to allow fast rebuild of failed OST without scanning? This is largely a userspace integration task, with some hooks into Lustre Metadata Redundancy via DNE2 distributed transactions New striped directory layout hash for mirrored directories, mirrored MDT inodes Use same mechanism as DNE2 for consistency and recovery Early design work started, discussions ongoing 17

Multi-Tiered Storage and File Level Redundancy Data locality, with direct access from clients to all storage tiers as needed Management Target (MGT) Metadata Targets (MDTs) SAS Object Storage Targets (OSTs) SATA SMR Archive OSTs (Erasure Coded) Metadata Servers (~10 s) Object Storage Servers (~1000 s) Policy Engine NVMe MDTs on client net Lustre Clients (~100,000+) NVMe OSTs/LNet routers on client network "Burst Buffer" 18

Notices and Disclaimers No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks Intel Corporation. Intel, Intel Inside, the Intel logo, Xeon, Intel Xeon Phi, Intel Xeon Phi logos and Xeon logos are trademarks of Intel Corporation or its subsidiaries in the United States and/or other countries. 19