Operational characteristics of a ZFS-backed Lustre filesystem

Similar documents
ZFS Internal Structure. Ulrich Gräf Senior SE Sun Microsystems

--enable-gss Strong authentication in Lustre&friends

Johann Lombardi High Performance Data Division

NPTEL Course Jan K. Gopinath Indian Institute of Science

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

OpenZFS Performance Analysis and Tuning. Alek 03/16/2017

An Introduction to the Implementation of ZFS. Brought to you by. Dr. Marshall Kirk McKusick. BSD Canada Conference 2015 June 13, 2015

ZFS: NEW FEATURES IN REPLICATION

Chapter 10: Mass-Storage Systems

ZFS: What's New Jeff Bonwick Oracle

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

CS 4284 Systems Capstone

LLNL Lustre Centre of Excellence

OpenZFS Performance Improvements

Storage Technologies - 3

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

Triton file systems - an introduction. slide 1 of 28

FhGFS - Performance at the maximum

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

The Btrfs Filesystem. Chris Mason

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

Lustre / ZFS at Indiana University

Experiences with HP SFS / Lustre in HPC Production

[537] RAID. Tyler Harter

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

The NetBSD Logical Volume Manager

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

Accelerating Spectrum Scale with a Intelligent IO Manager

Sun StorageTek 2540 / ZFS Performance Summary

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Table of Contents. Executive Overview... 1

ZFS The Last Word in Filesystem. chwong

ZFS Benchmarking. eric kustarz blogs.sun.com/erickustarz

ZFS The Last Word in Filesystem. tzute

RAIN: Reinvention of RAID for the World of NVMe. Sergey Platonov RAIDIX

ASPECTS OF DEDUPLICATION. Dominic Kay, Oracle Mark Maybee, Oracle

An Overview of Fujitsu s Lustre Based File System

V. Mass Storage Systems

Challenges in making Lustre systems reliable

Chapter 10: Mass-Storage Systems

Roads to Zester. Ken Rawlings Shawn Slavin Tom Crowe. High Performance File Systems Indiana University

OpenZFS: co je nového. Pavel Šnajdr LinuxDays 2017

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

CS370 Operating Systems

A Generic Methodology of Analyzing Performance Bottlenecks of HPC Storage Systems. Zhiqi Tao, Sr. System Engineer Lugano, March

Optimizing Local File Accesses for FUSE-Based Distributed Storage

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

ZFS The Future Of File Systems. C Sanjeev Kumar Charly V. Joseph Mewan Peter D Almeida Srinidhi K.

Introduction to Lustre* Architecture

SFA12KX and Lustre Update

Why not a traditional storage system What is ZFS ZFS Features and current status on Linux

Data storage on Triton: an introduction

Using Synology SSD Technology to Enhance System Performance Synology Inc.

FreeBSD/ZFS last word in operating/file systems. BSDConTR Paweł Jakub Dawidek

Feedback on BeeGFS. A Parallel File System for High Performance Computing

5.4 - DAOS Demonstration and Benchmark Report

SurFS Product Description

Porting ZFS 1) file system to FreeBSD 2)

Chapter 12: Mass-Storage

NPTEL Course Jan K. Gopinath Indian Institute of Science

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Linux Clusters Institute: ZFS Concepts, Features, Administration

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Chapter 11: Implementing File

ZFS Reliability AND Performance. What We ll Cover

White Paper. File System Throughput Performance on RedHawk Linux

Andreas Dilger, Intel High Performance Data Division LAD 2017

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

CISC 7310X. C11: Mass Storage. Hui Chen Department of Computer & Information Science CUNY Brooklyn College. 4/19/2018 CUNY Brooklyn College

Disk Scheduling COMPSCI 386

4 Port PCI Express 2.0 SATA III 6Gbps RAID Controller Card with HyperDuo SSD Tiering

Comparison of different distributed file systems for use with Samba/CTDB

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Vess A2000 Series. NVR Storage Appliance. SeeTec Surveillance Solution. Version PROMISE Technology, Inc. All Rights Reserved.

Choosing Hardware and Operating Systems for MySQL. Apr 15, 2009 O'Reilly MySQL Conference and Expo Santa Clara,CA by Peter Zaitsev, Percona Inc

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

CS143: Disks and Files

Chapter 10: File System Implementation

Disruptive Storage Workshop Hands-On ZFS

Storage Evaluations at BNL

COS 318: Operating Systems. Journaling, NFS and WAFL

Mass-Storage Structure

CS3600 SYSTEMS AND NETWORKS

ZFS The Last Word in Filesystem. frank

CSCI-GA Database Systems Lecture 8: Physical Schema: Storage

The Spider Center-Wide File System

Dell TM Terascala HPC Storage Solution

PCI Express SATA III RAID Controller Card with Mini-SAS Connector (SFF-8087) - HyperDuo SSD Tiering

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Structuring PLFS for Extensibility

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

RAIN: Reinvention of RAID for the World of NVMe

ZFS. Right Now! Jeff Bonwick Sun Fellow

Distributed Filesystem

Chapter 11: Implementing File Systems

Transcription:

Operational characteristics of a ZFS-backed Lustre filesystem Daniel Kobras science + computing ag IT-Dienstleistungen und Software für anspruchsvolle Rechnernetze Tübingen München Berlin Düsseldorf

science+computing ag Established 1989 Officies Tübingen Munich Berlin Düsseldorf Ingolstadt Bejing Employees Share Holders Turnaround 2013 275 Atos SE (100%) 30,70 Mio. Euro Portfolio IT Services for complex computing environments Comprehensive solutions for Linux- und Windows-based HPC scvenus system management software for efficient administration of homogeous and heterogeneous networks Seite 2

Use Case Seite 3

Environment Lustre 1.8 filesystem on JBOD hardware runs out of life ldiskfs on top of mdraid whacky hardware (frequent drive failures) abandoned storage management software (no-go with compliant Java versions) performance issues unless using Sun/Oracle Lustre kernel (stuck with 1.8.5) Seite 4

Environment Re-use storage and server hardware for scratch filesystem upgrade to current OS and Lustre versions avoid mdraid layer avoid obsolete JBOD management increase drive redundancy -> Lustre (2.5.3) with ZFS backend using raidz3 (8+3) vdevs -> Learn from previous installation and apply overall improvements, not specifically tied to ZFS Seite 5

Administration & Maintenance Seite 6

Deployment Looser dependency on kernel version (facilitates security updates) Using spl and zfs auto-build of modules via dkms Backported support for zfs/spl 0.6.4 (LU-6038) to Lustre 2.5.3 Step-by-step creation: Create ZPool Test with temporary ZFS posix layer Zap test filesystem Set canmount=off Create Lustre backend filesystems Seite 7

Stability Server OOM with zfs/spl 0.6.3 during iozone rewrite test Clients crashed due to empty ACLs (otherwise filtered out by ldiskfs, LU-5150) ZFS uncovered several faulty drives (previously undetected/unidentified by mdraid) Seite 8

JBOD storage There's life beyond crappy Java management tools There's standard tools and even a SCSI standard for talking to storage enclosures sg_ses and /sys/class/enclosure FTW! Easy, automated matching of /dev/sdx to corresponding drive slot Steering locator LEDs from OS, no need for external tools /dev/disk/by-vdev makes it even easier: ZFS reports failed drive in enclosure X, slot Y -> location immediately visible form zpool status output (optimize disk names for most common daily tasks, ie. open calls for failed drives, replace failed drives) Less impact from resilvering compared to md raid rebuild Seite 9

High Availability Pacemaker integration Split JBOD drives into 'left' and 'right' pools, controlled by either frontend server Separate resources for each ZPool and each Lustre target Customized scripts for more finegrained control than supplied init script ZFS: ZPool import with cachefile=none (no MMP equivalent, requires fencing) Filesystem: modified to grok ZFS pool syntax Seite 10

Debugging Seite 11

zdb Under-the-hood debugging of ZFS/ZPool internals Almost, but not quite, entirely unlike debugfs User interface apparently designed by Huffman-coding a standard set of options onto a Morse alphabet Calling zdb -ddddd is perfectly normal, and not to be mistaken with zdb -dddd Seite 12

zdb Example Find physical location of data in file 'x': # ls -i x 12345678 x # zdb -ddddd tank/fs1 12345678 Dataset tank/fs1 [ZPL], ID 45, cr_txg 19, 251M, 3292 objects, rootbp DVA[0]=<3:59b6000:800> DVA[1]=<0:5d67800:800> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800l/200p birth=1626187l/1626187p fill=3292 cksum=154bf6b90b:7136427a15d:147c0807d0853:2a4d1fac6823a (tbc) Seite 13

zdb Example # zdb -ddddd tank/fs1 12345678 (cont.) Object lvl iblk dblk dsize lsize %full type 12345678 2 16K 128K 259K 256K 100.00 ZFS plain file (...) Indirect blocks: 0 L1 3:59b0000:800 4000L/200P F=2 B=1626187/1626187 0 L0 3:5956800:2c000 20000L/20000P F=1 B=1626187/1626187 20000 L0 3:5982800:2c000 20000L/20000P F=1 B=1626187/1626187 Seite 14

zdb Examples Data virtual address (DVA): <vdev>:<offset>:<size> Data at L0 DVAs (if F=1) Map DVA to physical location on disk: Easy for simple vdev types (single drive, mirror): block = (offset + headersize) / blocksize, with headersize = 4*1024*1024 (4MB) For raidz{,2,3}: Read the source Or just zdb -R tank <DVA> (if you still can) Or cheat with strace zdb -e -R tank <DVA> (on exported test pool; analyze disk access)

Performance Characteristics Seite 16

Performance Baseline Study performance of Lustre client Relate to performance of local ZFS (ZPL) Study FS behaviour with more general I/O patterns, not just bulk I/O Study performance in relation to capabilities of storage hardware: Interconnect performance HDD performance Seite 17

throughput HDD Performance block size Seite 18

throughput HDD Performance char. size block size Seite 19

throughput HDD Performance random char. size bulk block size Seite 20

HDD Performance HDD datasheet: 1 TB capacity 7200 RPM 8-9 ms seek latency 90 MB/s sequential throughput Characteristic values: Max. rotational latency = 1 / RPM = 8.3 ms Random IOPS = 1s / (seek latency + 1/2 rotational latency) = 79 Seq. throughput / Random IOPS = 1.1 MB char. I/O size Seite 21

Performance Expectations Configuration: 6x vdev raidz3 (8+3) 128k recordsize No compression Bulk throughput: Writes need to transfer parity from host to disk Max. write throughput = 8/11 x max. read throughput Per vdev: Expected char. I/O size for 8+3 raidz3: 8 x 1.1 MB = 8.8 MB Might saturate at smaller I/O size due to interconnect limit Seite 22

Benchmark Results (ZPL) ZFS Posix Layer (IOZone results) 16 GB test file size 2000000 1800000 write rewrite read reread random read random write bkwd read stride read fwrite frewrite fread freread 1600000 throughput (kb/s) 1400000 1200000 1000000 800000 600000 400000 200000 0 64 128 256 512 1024 2048 4096 8192 16384 reclen (kb) Seite 23

Benchmark Results (Lustre Client) Lustre-Client (IOZone results) different scale! 128 GB test file size 800000 700000 write rewrite read reread random read random write bkwd read stride read fwrite frewrite fread freread 600000 throughput (kb/s) 500000 400000 300000 200000 100000 0 64 128 256 512 1024 2048 4096 8192 16384 reclen (kb) Seite 24

Results ZPL: RMW overhead for random I/O < recordsize Significant metadata overhead for write operations (write vs. rewrite, rewrite vs. read) For I/O units >= recordsize, random write performance similar to bulk Read performance also depends on write pattern, IOPS bound for I/O units < 4-8 MB In some tests, weird things happen at 4 MB I/O size Lustre: Glosses over ZPL small I/O performance drawbacks 2.5.3: consistent but slow single-stream performance (not shown here) Bulk I/O saturates interconnect Seite 25

Conclusion ZFS (and OS update) provide stable configuration on previously unstable hardware Easier administration and maintenance than mdraid-based setup Well suited for JBOD storage Different backend FS uncovered bug in Lustre, but otherwise stable ZFS performance meets expectations Lustre performance ok for most I/O loads with units >= 128kB single-stream performance generally slow, considering upgrade to later Lustre versions Seite 26

Vielen Dank für Ihre Aufmerksamkeit. Daniel Kobras science + computing ag www.science-computing.de Telefon 07071 9457-0 info@science-computing.de