Andreas Dilger, Intel High Performance Data Division LAD 2017

Similar documents
Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Andreas Dilger, Intel High Performance Data Division SC 2017

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Lustre * 2.12 and Beyond Andreas Dilger, Intel High Performance Data Division LUG 2018

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

Andreas Dilger High Performance Data Division RUG 2016, Paris

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Jessica Popp Director of Engineering, High Performance Data Division

Community Release Update

Community Release Update

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Intel & Lustre: LUG Micah Bhakti

Data life cycle monitoring using RoBinHood at scale. Gabriele Paciucci Solution Architect Bruno Faccini Senior Support Engineer September LAD

Olaf Weber Senior Software Engineer SGI Storage Software. Amir Shehata Lustre Network Engineer Intel High Performance Data Division

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Enhancing Lustre Performance and Usability

Fan Yong; Zhang Jinghai. High Performance Data Division

Johann Lombardi High Performance Data Division

Lustre overview and roadmap to Exascale computing

Zhang, Hongchao

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Lustre HPCS Design Overview. Andreas Dilger Senior Staff Engineer, Lustre Group Sun Microsystems

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

LNet Roadmap & Development. Amir Shehata Lustre * Network Engineer Intel High Performance Data Division

Optimization of Lustre* performance using a mix of fabric cards

Lustre HSM at Cambridge. Early user experience using Intel Lemur HSM agent

LFSCK High Performance Data Division

Remote Directories High Level Design

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Ben Walker Data Center Group Intel Corporation

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Introduction to Lustre* Architecture

Fast Forward I/O & Storage

Idea of Metadata Writeback Cache for Lustre

Lustre TM. Scalability

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

Fujitsu s Contribution to the Lustre Community

IME Infinite Memory Engine Technical Overview

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Multi-Rail LNet for Lustre

DNE2 High Level Design

Intel Unite Plugin Guide for VDO360 Clearwater

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

The Google File System

Olaf Weber Senior Software Engineer SGI Storage Software

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

Brent Gorda. General Manager, High Performance Data Division

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

SFA12KX and Lustre Update

Distributed System. Gang Wu. Spring,2018

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

OpenZFS Performance Improvements

Feedback on BeeGFS. A Parallel File System for High Performance Computing

An Exploration of New Hardware Features for Lustre. Nathan Rutman

The modules covered in this course are:

An Introduction to GPFS

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

IBM System Storage DS8870 Release R7.3 Performance Update

The Transition to PCI Express* for Client SSDs

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

LLNL Lustre Centre of Excellence

Lustre / ZFS at Indiana University

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

The Google File System

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

High Performance Computing The Essential Tool for a Knowledge Economy

<Insert Picture Here> Lustre Development

Evolving Small Cells. Udayan Mukherjee Senior Principal Engineer and Director (Wireless Infrastructure)

Fujitsu's Lustre Contributions - Policy and Roadmap-

Project Quota for Lustre

DL-SNAP and Fujitsu's Lustre Contributions

Storage. Hwansoo Han

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

Intel Unite Solution. Linux* Release Notes Software version 3.2

NVDIMM DSM Interface Example

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Challenges in making Lustre systems reliable

outthink limits Spectrum Scale Enhancements for CORAL Sarp Oral, Oak Ridge National Laboratory Gautam Shah, IBM

Milestone Solution Partner IT Infrastructure Components Certification Report

Entry-level Intel RAID RS3 Controller Family

Lustre Metadata Fundamental Benchmark and Performance

Live Migration of vgpu

Accelerate Finger Printing in Data Deduplication Xiaodong Liu & Qihua Dai Intel Corporation

Cloudian Sizing and Architecture Guidelines

LFSCK 1.5 High Level Design

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

Intel Stress Bitstreams and Encoder (Intel SBE) 2017 AVS2 Release Notes (Version 2.3)

SolidFire and Ceph Architectural Comparison

File System Implementation

Drive Recovery Panel

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Transcription:

Andreas Dilger, Intel High Performance Data Division LAD 2017 Statements regarding future functionality are estimates only and are subject to change without notice * Other names and brands may be claimed as the property of others.

Upcoming Feature Highlights 2.11 landings in progress with several features landed or underway File DLM lockahead Data-on-MDT for improved small file performance/latency File Level Redundancy (FLR Phase 1 delayed resync) DNE directory restriping for ease of space balancing and DNE2 adoption 2.12/2.13 plans continued functional and performance improvements File Level Redundancy continues (FLR Phase 2 immediate resync) DNE directory auto-split to improve usability and performance of DNE2 * Other names and brands may be claimed as the property of others. 2

LNet Dynamic Discovery (LU-9480 2.11) LNet Network Health (LU-9120 2.12) Builds on LNet Multi-Rail in Lustre 2.10 (Intel, HPE/SGI * ) LNet Dynamic Discovery Automatically configure peers that share multiple LNet networks Avoids need for admin to specify Multi-Rail configuration for nodes LNet Network Health Detect network interface, router faults Handle LNet fault w/o Lustre recovery Restore connection when available * Other names and brands may be claimed as the property of others. 3

Data-on-MDT Small File Perf (LU-3285 Intel 2.11) Avoid OST overhead (data, lock RPCs) High-IOPS MDTs (mirrored SSD vs. RAID-6 HDD) Avoid contention with streaming IO to OSTs Prefetch file data with metadata Size on MDT for small files Integrates with PFL to simplify usage Start file on MDT, grow onto OSTs if larger Complementary with DNE 2 striped directories Scale small file IOPS with multiple MDTs https://jira.hpdd.intel.com/browse/lu-3285 Client open, write data, read, attr layout, lock, size, read data Small file IO directly to MDS Example DoM/PFL File Layout MDT [0, 1MB) 4 OST stripes [1MB, 1GB) 60 OST stripes [1GB, ) MDS * Other names and brands may be claimed as the property of others. 4

DNE Improvements (LU-4684 Intel 2.11/2.12) Directory migration from single to striped/sharded directories Rebalance space usage, improve large directory performance Inodes are also migrated along with directory entries Automatic directory restriping to reduce/avoid need for explicit striping at create Start with single-stripe directory for low overhead in common use cases Add extra shards when master directory grows large enough (e.g. 32k entries) Existing dir entries stay in master, or are migrated to shards asynchronously? New entries+inodes created in new directory shards on MDTs to distribute load Performance scales as directory grows MDT Pools for space/class management Master +4 dir shards +12 directory shards * Other names and brands may be claimed as the property of others. 5

ZFS Enhancements Related to Lustre (2.11+) Lustre 2.10.1/2.11 osd-zfs updated to use ZFS 0.7.1 File create performance (parallel lock/alloc, new APIs) (Intel) LFSCK ZFS OI Scrub, ldiskfs->zfs backup/restore (LU-7585 Intel) Features in ZFS 0.7.x Dynamic dnode size for better xattr performance/space (LLNL) Optimized parallel dnode allocation (Delphix, LLNL, Intel) Improved kernel IO buffers allocation (ABD) (others, Intel) Multi-mount protection (MMP) for improved HA safety (LLNL) Optimized CPU and QAT h/w checksums, parity (others, Intel) Better JBOD/drive handling (LEDs, auto drive resilver) (LLNL) Features for ZFS 0.8.x On-disk encryption (Datto) Project quota accounting (Intel) Declustered RAID (draid) (Intel) Metadata Allocation Class (Intel) Likely lots more... * Other names and brands may be claimed as the property of others. 6

Miscellaneous Improvements (2.11) Client Asynchronous ladvise Lock Ahead (LU-6179 Cray * ) Client (MPI-IO) to request read/write DLM locks before IO to avoid latency/contention MPI-I/O integration so applications can benefit from this without modification Token Bucket Filter (NRS-TBF) Improvements (LU-9658, LU-9228 DDN * ) Support UID/GID for TBF policies, compound policy specification Real-time priorities for TBF rules if server is overloaded File access and modification auditing with ChangeLogs (LU-9727 DDN) Record UID/GID/NID in ChangeLog for open() (success or fail), other operations Nodemap and virtualization improvements (LU-8955 DDN) Configure audit, SELinux, default file layout on per-nodemap basis * Other names and brands may be claimed as the property of others. 7

Upstream Kernel Client (LU-9679 ORNL) Kernel 4.14 updated to approximately Lustre 2.8, with some fixes from Lustre 2.9 Lustre 2.10 updated to work with kernel ~4.12 (LU-9558) Improve kernel internal time handling (LU-9019) 64-bit clean to avoid Y2038 issues remove jiffies and cfs_time_*() wrapper functions Continued user header changes (LU-6401) Allow building user tools against upstream kernel Kernel tracepoints for logging/debugging/perf analysis (LU-8980 Replace CDEBUG() macros and Lustre kernel debug logs Mmm, Lustre client Has potential to improve (or not?) debugging of Lustre problems, needs careful review * Other names and brands may be claimed as the property of others. 8

File Level Redundancy (LU-9771 Intel 2.11/2.12) Based on Progressive File Layout (PFL) feature in Lustre 2.10 (Intel, ORNL) Significant value and functionality added for HPC and other environments Optionally set on a per-file/dir basis (e.g. mirror input files and one daily checkpoint) Higher availability for server/network failure finally better than HA failover Robustness against data loss/corruption mirror (and later M+N erasure coding) Increased read speed for widely shared input files N-way mirror over many OSTs Mirror/migrate files over multiple storage classes NVRAM->SSD->HDD (e.g. Burst Buffer) Local vs. remote replicas (WAN) Partial HSM file restore Replica 0 Object j (primary, preferred) File versioning (no resync replica) Replica 1 Object k (stale) delayed resync Many more possibilities... * Other names and brands may be claimed as the property of others. 9

Client-Side Data Compression (LU-10026 2.12) Performance estimate with compression performed within IOR University Hamburg Piecewise compression Compressed in 32KB chunks Allows sub-block read/write Integrated with ZFS data blocks Leverage per-block type/size Code/disk format changes needed Avoid de-/re-compressing data Good performance/space benefits Graph courtesy Uni Hamburg * Other names and brands may be claimed as the property of others. 10

Tiered storage with Composite Layouts (2.12/2.13) Policy engine to manage migration over tiers, rebuild replicas, ChangeLogs Policies for pathname, user, extension, age, OST pool, mirror copies,... FLR provides mechanisms for safe migration of (potentially in-use) data Integration with job scheduler and workflow for prestage/drain/archive Multiple policy and scanning engines presented at LUG'17 Multiple presentations on tiered storage at LAD'17 Integrated burst buffers are a natural starting point This is largely a userspace integration task, with some hooks into Lustre * Other names and brands may be claimed as the property of others. 11

Tiered Storage and File Level Redundancy Data locality, with direct access from clients to all storage tiers as needed Management Target (MGT) Metadata Targets (MDTs) SAS Object Storage Targets (OSTs) SATA SMR Archive OSTs (Erasure Coded) Metadata Servers (~10 s) Object Storage Servers (~1000 s) Policy Engine NVMe MDTs on client net Lustre Clients (~100,000+) NVMe OSTs/LNet routers on client network "Burst Buffer" * Other names and brands may be claimed as the property of others. 12

Improved client efficiency (2.11/2.12+) Small file write optimizations (LU-1575, LU-9409 Cray, Intel) Reduce client and RPC/server overhead for small (<= 4KB) reads/writes Disconnect idle clients from servers (LU-7236 Intel) Reduce memory usage on client and server for large systems Reduce network pings and recovery times Aggregate statfs() RPCs on the MDS (LU-10018) Reduce wakeups and background tasks on idle clients (LU-9660 Intel) Synchronize wakeups between threads/clients (per jobid?) to minimize jitter Still need to avoid DOS of server if all clients ping/reconnect at same time * Other names and brands may be claimed as the property of others. 13

DNE Metadata Redundancy (2.13+) New directory layout hash for mirrored directories, mirrored MDT inodes Each dirent copy holds multiple MDT FIDs for inodes Store dirent copy on each mirror directory shard Name lookup on any MDT can access via any FID Copies of mirrored inodes stored on different MDTs DNE2 distributed transaction for update recovery Ensure that copies stay in sync on multiple MDTs Redundancy policy per-filesystem or subtree, is flexible Flexible MDT space/load balancing with striped dirs Early design work started, discussions ongoing Parent Dir Sub Dir home Dir 2:A foo FID 2:A FID 2:1 Inode LOV layout FID 2:1 MDT0002 FID 5:B FID 5:2 home Dir 5:B foo FID 2:A Inode FID 2:1 LOV layout FID 5:2 MDT0005 FID 5:B FID 5:2 * Other names and brands may be claimed as the property of others. 14

Notices and Disclaimers No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. No animals were harmed during the production of these slides. BPA and gluten free. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks Intel Corporation. Intel, Intel Inside, the Intel logo, Xeon, Intel Xeon Phi, Intel Xeon Phi logos and Xeon logos are trademarks of Intel Corporation or its subsidiaries in the United States and/or other countries. * Other names and brands may be claimed as the property of others. 15