The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Similar documents
HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

OpenZFS Performance Improvements

OpenZFS Performance Analysis and Tuning. Alek 03/16/2017

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

SFA12KX and Lustre Update

Lustre / ZFS at Indiana University

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Lustre Metadata Fundamental Benchmark and Performance

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

High-Performance Lustre with Maximum Data Assurance

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

New HPE 3PAR StoreServ 8000 and series Optimized for Flash

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

Lustre overview and roadmap to Exascale computing

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

NetApp High-Performance Storage Solution for Lustre

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Parallel File Systems for HPC

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

InfiniBand Networked Flash Storage

Lustre & ZFS Go to Hollywood Lustre User Group 2013

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

ZFS The Last Word in Filesystem. chwong

ZFS The Last Word in Filesystem. tzute

Feedback on BeeGFS. A Parallel File System for High Performance Computing

An Exploration of New Hardware Features for Lustre. Nathan Rutman

High Performance Computing. NEC LxFS Storage Appliance

Johann Lombardi High Performance Data Division

T10PI End-to-End Data Integrity Protection for Lustre

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

ZFS The Future Of File Systems. C Sanjeev Kumar Charly V. Joseph Mewan Peter D Almeida Srinidhi K.

HOW TRUENAS LEVERAGES OPENZFS. Storage and Servers Driven by Open Source.

FhGFS - Performance at the maximum

NetApp: Solving I/O Challenges. Jeff Baxter February 2013

THE SUMMARY. CLUSTER SERIES - pg. 3. ULTRA SERIES - pg. 5. EXTREME SERIES - pg. 9

An Overview of Fujitsu s Lustre Based File System

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Lustre TM. Scalability

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

ZFS The Last Word in Filesystem. frank

Introduction to Lustre* Architecture

Andreas Dilger, Intel High Performance Data Division LAD 2017

Introducing Panasas ActiveStor 14

The Leading Parallel Cluster File System

Architecting a High Performance Storage System

ZFS Reliability AND Performance. What We ll Cover

The modules covered in this course are:

ZFS: What's New Jeff Bonwick Oracle

Chapter 10: Mass-Storage Systems

Hitachi Virtual Storage Platform Family

Brent Welch. Director, Architecture. Panasas Technology. HPC Advisory Council Lugano, March 2011

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

2012 HPC Advisory Council

Open Source Storage. Ric Wheeler Architect & Senior Manager April 30, 2012

SurFS Product Description

vsan 6.6 Performance Improvements First Published On: Last Updated On:

Linux Clusters Institute: ZFS Concepts, Features, Administration

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

Extraordinary HPC file system solutions at KIT

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.3

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Software-defined Storage: Fast, Safe and Efficient

The ZFS File System. Please read the ZFS On-Disk Specification, available at:

Accelerating Spectrum Scale with a Intelligent IO Manager

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

5.4 - DAOS Demonstration and Benchmark Report

Refining and redefining HPC storage

Operational characteristics of a ZFS-backed Lustre filesystem

OpenZFS: co je nového. Pavel Šnajdr LinuxDays 2017

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

Demonstration Milestone for Parallel Directory Operations

Breakthrough Cloud Performance NeoSapphire All-Flash Arrays. AccelStor, Inc.

Isilon Performance. Name

Choosing Hardware and Operating Systems for MySQL. Apr 15, 2009 O'Reilly MySQL Conference and Expo Santa Clara,CA by Peter Zaitsev, Percona Inc

SDSC s Data Oasis Gen II: ZFS, 40GbE, and Replication

THESUMMARY. ARKSERIES - pg. 3. ULTRASERIES - pg. 5. EXTREMESERIES - pg. 9

Caching and consistency. Example: a tiny ext2. Example: a tiny ext2. Example: a tiny ext2. 6 blocks, 6 inodes

HIGH PERFORMANCE STORAGE SOLUTION PRESENTATION All rights reserved RAIDIX

Lustre HPCS Design Overview. Andreas Dilger Senior Staff Engineer, Lustre Group Sun Microsystems

<Insert Picture Here> Oracle Storage

Storage Evaluations at BNL

ASPECTS OF DEDUPLICATION. Dominic Kay, Oracle Mark Maybee, Oracle

An Introduction to the Implementation of ZFS. Brought to you by. Dr. Marshall Kirk McKusick. BSD Canada Conference 2015 June 13, 2015

Btrfs Current Status and Future Prospects

NPTEL Course Jan K. Gopinath Indian Institute of Science

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

Get More Out of Storage with Data Domain Deduplication Storage Systems

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

OpenVMS Storage Updates

SUN ZFS STORAGE 7X20 APPLIANCES

The Btrfs Filesystem. Chris Mason

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

Transcription:

The current status of the adoption of ZFS as backend file system for Lustre: an early evaluation Gabriele Paciucci EMEA Solution Architect

Outline The goal of this presentation is to update the current status of the ZFS and Lustre implementation in a Sys Admin prospective. Johann will cover the development status. Benefits of ZFS and Lustre implementation Performance Sequential I/O Metadata Reliability Availability 2

ZFS benefits ZFS is a robust, scalable file system with features not available in other file systems today. ZFS can enable cheaper storage solution for Lustre and increase the reliability of data on the next generation of fat HDDs. Hybrid Storage Pool (ARC+L2ARC+ZIL) Copy on Write (COW) Checksum Always consistent on disk Snapshot and replication Resilvering Manageability Compression and deduplication 3

How do Lustre and ZFS interact? Lustre depends on the ZFS on Linux implementation of ZFS Lustre targets run on a local file system on Lustre servers. Object Storage Device (OSD) layer supported are: ldiskfs (EXT4) is the commonly used driver ZFS is the 2nd use of the OSD layer based on OpenZFS implementation Lustre targets can be different types (hybrid ldiskfs/zfs is possible) Lustre Clients are unaffected by the choice of OSD file system Lustre ZFS is functional since Lustre 2.4 4

Our LAB and partner s PoC There is a strong interest in our partners and end user to test the Lustre + ZFS combination. This work try to summarize the results from benchmarks run on the Intel s HPC LAB in Swindon (UK) managed by Jamie Wilcox and early results from partner s proof-of-concept Intel is actively improve the ZFS + Lustre integration (see Johann deck) Intel Enterprise Edition for Linux will support ZFS in Q4 2014 5

Object Storage Targets (OSTs) Object Storage Targets (OSTs) Management Target (MGT) Metadata Target (MDT) 4x Intel S3700 400GB SSD MIRROR+ RAID10 STRIPE 4x NL-SAS 4TB 4x NL-SAS 4TB 4x NL-SAS 4TB 4x NL-SAS 4TB 4x NL-SAS 4TB 4x NL-SAS 4TB 4x NL-SAS 4TB 4x NL-SAS 4TB 4x NL-SAS 4TB 4x NL-SAS 4TB 4x NL-SAS 4TB 4x NL-SAS 4TB 4x NL-SAS 4TB 4x NL-SAS 4TB 4x NL-SAS 4TB 4x NL-SAS 4TB RAID5 RAIDZ2 RAID5 RAID5 RAIDZ2 RAID5 2x E5 2643v2 3Ghz 128GB RAM 2x E5 2643v2 3Ghz 64GB RAM Management Network Metadata Servers Object Storage Servers Object Storage Servers Intel Manager for Lustre High Performance Data Network (InfiniBand Mellanox FDR) Intel RAID cards for LDISKFS RAID-Z2 for ZFS Lustre Clients (16) Lustre server version 2.6.50 ZFS version 0.6.3 IEEL 2.0 as Lustre client based on 2.5 6

Sequential I/O LDISKFS vs RAID-Z2 7000 6000 5000 4000 3000 2000 1000 0 MB/sec WRITE IOR results from 384 threads and an aggregate file size of 3TB Theoretical performance is 6.7 GB/sec ( 48x 4TB NL-SAS at 140MB/sec) 16x OSTs for ldiskfs using Intel RAID cards and RAID5 with 4 HDD 8x OSTs for RAID-Z2 using 8 HDD LDISKFS RAIDZ2 ldiskfs pays the penalty of fragmentation and misalignment confirmed by brw_stats for this uncommon configuration. ZFS s write performance is in line with the expectations and take advantage of COW. ZFS s read is limited by the small record size and concurrency READ 7

Sequential I/O Where to improve? CPU frequency on OSS matter. Accurate design is necessary (+3Ghz) Increasing the record size to 16M. A record size of 128K doesn t fit Lustre s workload! Potentially the FIEMAP work on the Network Request Scheduler (NRS) can efficiently schedule a large number of requests better than what the ZFS code can do itself. Increase the number of concurrent reads in the OSD-ZFS layer Optimize ZFS I/O scheduler for Lustre Parameter Default MIN Default MAX zfs vdev sync READ 10 10 zfs vdev sync WRITE 10 10 zfs vdev async READ 3 1 zfs vdev async WRITE 10 1 zfs vdev resilvering scrubbing 2 1 zfs vdev scheduler noop 8

Metadata LDISKFS vs ZFS 80000 70000 60000 50000 40000 30000 20000 10000 0 ops/sec CREATE STAT UNLINK MKNOD LDISKFS ZFS The Metadata performance is comparable only on stats MDSRATE results from 32 threads, 3.2M files in 32 dirs ldiskfs using Intel RAID card with FastPath technology. RAID10 array with 4x Intel S3700 SSD 400GB ZFS (0.6.2) using a striped mirrored vdev with 4 Intel S3700 SSD 400GB 9

Metadata ZFS s Memory Management ARC MAX SIZE ARC MAX SIZE CPU WAIT ARC SIZE ARC SIZE ARC_META SIZE NOT RESET ARC_META LIMIT ARC_META SIZE CPU WAIT ARC_META LIMIT ARC_META SIZE It s critical tune the ZFS s ARC_META_LIMIT parameter on the MDS in order to achieve decent results. 10

Metadata Where to improve? A lot of work is needed to improve the metadata performance. Increase ZAP indirect and leaf block size Better ARC_META memory management for Lustre s workload Improve ZAP caching Compared to ldiskfs, ZFS has some benefits: Not limited to 4 billions of Lustre s inode Better scalability in the code. ldiskfs metadata performance is limited by the size of the directory: leaf block updates will become more random as the directory gets larger. 11

Reliability Is RAID dead? MTTDL is safe using RAID with double parity Hard Error Rate (HER): when the disk can t read a sector, disk failed and causes a RAID rebuild Probability of a second HER during rebuild is high with larger NL-SAS Because there are cables and electronics, storage channels are susceptible to Silent Data Corruption (SDC) Device Hard Error Rate in bits Equivalent in PB s SATA Consumer 10E14 0.01 SATA/SAS Nearline Enterprise 10E15 0.11 Enterprise SAS/FC 10E16 1.11 LTO and some Enterprise SAS SSD s 10E17 11.10 Enterprise Tape or greater 10E19 1110.22 SDC Rate Sustained Transfer Rate per Second for a Year 10 100 1 10 GiB/s GiB/s TiB/s TiB/s SAS/FC 10E21 0.0 0.0 0.3 2.7 SATA/IB Standard 10E20 0.0 0.3 2.7 27.1 10E19 0.3 2.7 27.1 270.9 10E18 2.7 27.1 270.9 2,708.9 10E17 27.1 270.9 2,708.9 27,089.2 10E16 270.9 2,708.9 27,089.2 270,892.2 10E15 2,708.9 27,089.2 270,892.2 2,708,921.8 12

Reliability with ZFS ZFS only copies live, or relevant, blocks of data when creating mirrors or RAID groups. This means that it takes nearly zero time to create and initialize new RAID sets. Always consistent on disk (software bug or massive corruption?): FSCK is a challenge for a RAID6 using 10x 6TB NL/SAS MDRAID need a rescan of the RAID array at each reboot Scrubbing: ZFS can do background scrubbing of data by reading all of the blocks and doing checksum comparison and correction. Resilvering ZFS addresses the rebuild resilvering top-down from the most important blocks in its tree to the least only reconstructing blocks that matter and writing those on the new drive, and it verifies the validity of every block read using its checksums, stored in the parent blocks along the way. This is extremely significant for both efficiency and data integrity especially as drives continue to grow. 13

Performance regression during repair 6000 5000 4000 3000-15% -22% 2000 1000 MB/sec 0 WRITE BASELINE RESILVERING SCRUBBING READ Only READS are affected during repair. Resilvering and scrubbing are autotuned by the ZFS I/O scheduler. IOR results from 384 threads and an aggregate file size of 1.5TB RAID-Z2 using 7 HDD. 8 OSTs are available. Resilvering and Scrubbing on one OST during all the IOR run 14

Reliability Where to improve? Declustering ZFS Randomize distribution of RAIDZ redundancy groups and spares to have full bandwidth during resilvering. ZFS Event Daemon ZED can trap events and make actions. ZED is implemented in the latest version (0.6.3) of ZFS ZED can help to add hot spare disks automatically 15

Availability Lustre depends on the ZFS on Linux implementation of ZFS Integration with Pacemaker/Corosync is not a problem Pools should be imported in a non-persistent way One script to import/export the pools One script to mount/unmount Lustre 16

Conclusion ZFS can enable a safe and efficient software RAID solutions using JBODs. ZFS can guarantee a robust data protection without any special protocol (T10-PI). Sequential write performance are inline with the expectations. Intel and the ZFS community is working to improve performance. Evaluation of large (1TB+) SSD device on OST for L2ARC could be a nice next step for this work. Evaluation of Enterprise functionalities like DeDup and Compression. 17