OpenZFS Performance Improvements

Similar documents
OpenZFS Collaboration

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

OpenZFS Performance Analysis and Tuning. Alek 03/16/2017

Johann Lombardi High Performance Data Division

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

ZFS Internal Structure. Ulrich Gräf Senior SE Sun Microsystems

5.4 - DAOS Demonstration and Benchmark Report

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

An Introduction to the Implementation of ZFS. Brought to you by. Dr. Marshall Kirk McKusick. BSD Canada Conference 2015 June 13, 2015

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

ZFS Reliability AND Performance. What We ll Cover

Andrew Gabriel Cucumber Technology Ltd 17 th June 2015

NPTEL Course Jan K. Gopinath Indian Institute of Science

The Future of ZFS in FreeBSD

The ZFS File System. Please read the ZFS On-Disk Specification, available at:

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Now on Linux! ZFS: An Overview. Eric Sproul. Thursday, November 14, 13

Andreas Dilger, Intel High Performance Data Division LAD 2017

ZFS: What's New Jeff Bonwick Oracle

ASPECTS OF DEDUPLICATION. Dominic Kay, Oracle Mark Maybee, Oracle

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Non-Blocking Writes to Files

Enhancing Lustre Performance and Usability

Andreas Dilger, Intel High Performance Data Division SC 2017

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

ZFS The Last Word in Filesystem. chwong

ZFS The Last Word in Filesystem. tzute

Introduction to Lustre* Architecture

LLNL Lustre Centre of Excellence

Buffer Management for XFS in Linux. William J. Earl SGI

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

OpenZFS: co je nového. Pavel Šnajdr LinuxDays 2017

Oracle 1Z Oracle ZFS Storage Appliance 2017 Implementation Essentials.

ZFS The Last Word in Filesystem. frank

NPTEL Course Jan K. Gopinath Indian Institute of Science

Operational characteristics of a ZFS-backed Lustre filesystem

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Ben Walker Data Center Group Intel Corporation

Progress on Efficient Integration of Lustre* and Hadoop/YARN

<Insert Picture Here> Lustre Development

CSE 153 Design of Operating Systems

TOSS - A RHEL-based Operating System for HPC Clusters

6.5 Collective Open/Close & Epoch Distribution Demonstration

Developments in GFS2. Andy Price Software Engineer, GFS2 OSSEU 2018

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

COS 318: Operating Systems. File Systems. Topics. Evolved Data Center Storage Hierarchy. Traditional Data Center Storage Hierarchy

HOW TRUENAS LEVERAGES OPENZFS. Storage and Servers Driven by Open Source.

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Improved Versioning, Building, and Distribution of Lustre

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Alternatives to Solaris Containers and ZFS for Linux on System z

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

Using Transparent Compression to Improve SSD-based I/O Caches

Storage and File System

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Lustre / ZFS at Indiana University

ZFS Async Replication Enhancements Richard Morris Principal Software Engineer, Oracle Peter Cudhea Principal Software Engineer, Oracle

ZFS: NEW FEATURES IN REPLICATION

The Google File System

Open BSDCan. May 2013 Matt

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

High Performance Computing. NEC LxFS Storage Appliance

Linux Clusters Institute: ZFS Concepts, Features, Administration

White paper ETERNUS CS800 Data Deduplication Background

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Lustre * 2.12 and Beyond Andreas Dilger, Intel High Performance Data Division LUG 2018

MS2 Lustre Streaming Performance Improvements

April 2010 Rosen Shingle Creek Resort Orlando, Florida

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Enterprise Filesystems

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

CS5460: Operating Systems Lecture 20: File System Reliability

Choosing Hardware and Operating Systems for MySQL. Apr 15, 2009 O'Reilly MySQL Conference and Expo Santa Clara,CA by Peter Zaitsev, Percona Inc

CA485 Ray Walshe Google File System

Making the Box Transparent: System Call Performance as a First-class Result. Yaoping Ruan, Vivek Pai Princeton University

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Open Source Storage. Ric Wheeler Architect & Senior Manager April 30, 2012

Google File System. Arun Sundaram Operating Systems

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

Veritas Storage Foundation and. Sun Solaris ZFS. A performance study based on commercial workloads. August 02, 2007

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Lustre A Platform for Intelligent Scale-Out Storage

Foster B-Trees. Lucas Lersch. M. Sc. Caetano Sauer Advisor

Fine-Grained Latency Measurement David J. Cuddihy ATTO Technology, Inc.

Support for external data transformation in ZFS

Purity: building fast, highly-available enterprise flash storage from commodity components

The Btrfs Filesystem. Chris Mason

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

InfiniBand Networked Flash Storage

Google Disk Farm. Early days

SolidFire and Ceph Architectural Comparison

Lustre HPCS Design Overview. Andreas Dilger Senior Staff Engineer, Lustre Group Sun Microsystems

2010 Oracle Corporation 1

Benefits of Storage Capacity Optimization Methods (COMs) And. Performance Optimization Methods (POMs)

Architecting a High Performance Storage System

Fragging Rights. A Tale of a Pathological Storage Workload. Eric Sproul, Circonus. ZFS User Conference March 2017

Remote Directories High Level Design

Transcription:

OpenZFS Performance Improvements LUG Developer Day 2015 April 16, 2015 Brian, Behlendorf This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC 1

Agenda IO Performance Large Block Feature Meta Data Performance Large DNode Feature Planned Performance Investigations 2

IO Performance Large Blocks All objects in ZFS are divided in to blocks 3

Virtual Devices (VDEVs) Root Root Top-Level Mirror RAIDZ Leaf Data Data Data Data Parity Where blocks are stored is controlled by the VDEV topology 4

Why is 128K the Max Block Size? Historical Reasons Memory and disk capacities were smaller 10 years ago 128K blocks were enough to saturate the hardware Good for a wide variety of workloads Good for a range of configurations Performed well enough for most users But times have changed and it s now limiting performance 5

Simulated Lustre OSS Workload fio Flexible I/O Tester 64 Threads 1M IO Size 4G Files Random read/write workload All tests run after import to ensure a cold cache All default ZFS tunings, except: zfs_prefetch_disabled=1 zfs set recordsize={4k..1m} pool/dataset Workload designed to benchmark a Lustre OSS 6

Random 1M R/W 2-Disk Mirror MB/s vs ZFS Block Size 180 160 140 120 100 80 Read Mirror 60 Write Mirror 40 20 0 Read bandwidth increases with block size 7

Random 1M R/W 10-Disk RAIDZ MB/s vs ZFS Block Size 1400 1200 1000 800 600 400 200 0 Read RAIDZ Write RAIDZ Read RAIDZ2 Write RAIDZ2 Read RAIDZ3 Write RAIDZ3 Read and write bandwidth increase with block size 8

Block Size Tradeoffs Bandwidth vs IOPs Larger blocks better bandwidth Smaller blocks better IOPs ARC Memory Usage Small and large blocks have the same fixed overhead Larger blocks may result in unwanted cached data Larger block are harder to allocate memory for Overwrite Cost Checksum Cost It s all about tradeoffs 9

Block Size Tradeoffs Disk Fragmentation and Allocation Cost Gang Blocks Disk Space Usage Snapshots Compression Ratio Larger blocks compress better De-Duplication Ratio Smaller blocks de-duplicate better The right block size depends on your needs 10

Let s Increase the Block Size The 128K limit Only limited by the original ZFS implementation The on-disk format supports up to 16M blocks New Large Block Feature Flag Easy to implement for prototyping / benchmarking Hard to polish for real world production use Cross-platform compatibility zfs send/recv compatibility Requires many subtle changes throughout the code base We expect bandwidth to improve 11

Random 1M R/W 2-Disk Mirror MB/s vs ZFS Block Size 250 200 150 100 50 Read Mirror Write Mirror 0 63% performance improvement for 1M random reads! 12

Random 1M R/W 10-Disk RAIDZ MB/s vs ZFS Block Size 1400 1200 1000 800 600 400 200 0 Read RAIDZ Write RAIDZ Read RAIDZ2 Write RAIDZ2 Read RAIDZ3 Write RAIDZ3 100% performance improvement for 1M random reads! 13

I/O Performance Summary 1M blocks can improve random read performance by a factor of 2x for RAIDZ Blocks up to 16M are possible Requires ARC improvements (Issue #2129) Early results show 1M blocks may be the sweet spot >1M blocks may help large RAIDZ configurations Planned for the next Linux OpenZFS release Large blocks already merged upstream Large blocks can improve performance 14

Meta Data Performance - Large Dnodes MDT stores file striping in an extended attribute Optimized for LDISKFS 1K inodes with inline xattr Single IO per xattr INODE TABLE 1024B INODE INLINE Required Block Optional Block EXTERNAL 4K BLOCK LDISKFS was optimized long long ago for Lustre 15

File Forks used on illumos / FreeBSD Linux xattrs=on property Maps xattr to file forks No compatibility issues No limit on xattr size No limit on number of xattrs Three IOs per xattr 16K DNODE BLOCK 512B DNODE ZAP BONUS SPACE Required Block Optional Block The need for fast xattrs was known early in development 16

Extended Attributes in SAs Goal is to reduce I/Os Linux xattrs=sa property Maps xattr to SA 1 block of SAs (128K ) 16K DNODE BLOCK 512B DNODE BONUS SPACE Much faster Two IOs per xattr Required Block Optional Block ZAP SPILL BLOCK The first step was to store xattrs as a system attribute (SA) 17

Large DNode Feature 512B-16K DNODE 16K DNODE BLOCK BONUS SPACE BONUS SPACE ZAP SPILL BLOCK ZAP SPILL BLOCK Required Block Optional Block Only a single I/O is needed to read multiple dnodes with xattrs 18

Meta Data Workload xattrtest performance and correctness test https://github.com/behlendorf/xattrtest 4096 files / process, 512 bytes xattr per file 64 processes All tests run after import to ensure a cold cache 16-disk (HDD) mirror pool All default ZFS tunings 19

Xattrtest Performance 60000 50000 40000 30000 20000 10000 v0.6.4 xattr=on dnodesize=512 v0.6.4 xattr=sa dnodesize=512 v0.6.4 xattr=sa dnodesize=1024 0 Create/s Setxattr/s Getxattr/s Unlink/s 20

ARC Memory Footprint 120 100 80 60 40 20 120 100 80 60 40 20 v0.6.3 xattr=sa dnodesize=512 v0.6.4 xattr=sa dnodesize=512 v0.6.4 xattr=sa dnodesize=1024 0 ARC Size (%) 0 ARC Blocks (%) 16x fewer blocks reduces I/O and saves ARC memory 21

mds_survey.sh 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Create/s Destroy/s v0.6.3 xattr=sa dnodesize=512 v0.6.4 xattr=sa dnodesize=512 v0.6.4 xattr=sa dnodesize=1024 30% performance improvement for creates and destroys 22

Large DNode Performance Summary Improves performance across the board Total I/O required for cold files reduced ARC Fewer cached blocks (no spill blocks) Reduced memory usage Smaller MRU/MFU results in faster memory reclaim and reduced lock contention in the ARC Reduces pool fragmentation Increasing the dnode size has many benefits 23

Large DNode Work in Progress Developed by Prakash Surya and Ned Bass Conceptually straightforward but challenging Undergoing rigorous testing ztest, zfs-stress, xattrtest, mds_survey, ziltest 80% done, under active development: ZFS Intent Log (ZIL) support zfs send/recv compatibility Patches will be submitted for upstream review 24

Planned Performance Investigations Support the ZFS Intent Log (ZIL) in Lustre Improve Lustre s use of ZFS level prefetching Optimize Lustre specific objects (llogs, etc) Explore TinyZAP for Lustre namespace More granular ARC locking (issue #3115) Page backed ARC buffers (issue #2129) Hardware optimized checksums / parity There are many areas where Lustre MDT performance can be optimized 25

Questions? 26

SPL-0.6.4 / ZFS-0.6.4 Released Released April 8th, 2015. Packages available for: Six New Feature Flags *Spacemap Histograms Extensible Datasets Bookmarks Enabled TXGs Hole Birth *Embedded Data 27

SPL-0.6.4 / ZFS-0.6.4 Released New Functionality Asynchronous I/O (AIO) support Hole punching via fallocate(2) Two new properties (redundant_metadata, overlay) Various enhancements to command tools Performance Improvements zfs send/recv Spacemap histograms Faster memory reclaim Over 200 Bug Fixes 28

New KMOD Repo for RHEL/CentOS DKMS and KMOD packages for EPEL 6 Versions: spl-0.6.4, zfs-0.6.4, lustre-2.5.3 Enable the repository vim /etc/yum.repos.d/zfs.repo Disable the default zfs (DKMS) repository Enable the zfs-kmod repository Install ZFS, Lustre server, or Lustre client yum install zfs yum install lustre-osd-zfs yum install lustre See zfsonlinux.org/epel for complete instructions 29