GOODBYE, XFS: BUILDING A NEW, FASTER STORAGE BACKEND FOR CEPH SAGE WEIL RED HAT

Similar documents
What's new in Jewel for RADOS? SAMUEL JUST 2015 VAULT

Ceph Block Devices: A Deep Dive. Josh Durgin RBD Lead June 24, 2015

INTRODUCTION TO CEPH. Orit Wasserman Red Hat August Penguin 2017

DISTRIBUTED STORAGE AND COMPUTE WITH LIBRADOS SAGE WEIL VAULT

A fields' Introduction to SUSE Enterprise Storage TUT91098

CEPHALOPODS AND SAMBA IRA COOPER SNIA SDC

Ceph Rados Gateway. Orit Wasserman Fosdem 2016

Datacenter Storage with Ceph

virtual machine block storage with the ceph distributed storage system sage weil xensummit august 28, 2012

Deploying Software Defined Storage for the Enterprise with Ceph. PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu

Understanding Write Behaviors of Storage Backends in Ceph Object Store

ROCK INK PAPER COMPUTER

Cloud object storage in Ceph. Orit Wasserman Fosdem 2017

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

architecting block and object geo-replication solutions with ceph sage weil sdc

A New Key-Value Data Store For Heterogeneous Storage Architecture

Why software defined storage matters? Sergey Goncharov Solution Architect, Red Hat

Jason Dillaman RBD Project Technical Lead Vault Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring

Ceph Intro & Architectural Overview. Abbas Bangash Intercloud Systems

THE CEPH POWER SHOW. Episode 2 : The Jewel Story. Daniel Messer Technical Marketing Red Hat Storage. Karan Singh Sr. Storage Architect Red Hat Storage

WHAT S NEW IN LUMINOUS AND BEYOND. Douglas Fuller Red Hat

MySQL and Ceph. A tale of two friends

<Insert Picture Here> Btrfs Filesystem

A Gentle Introduction to Ceph

Ceph Snapshots: Diving into Deep Waters. Greg Farnum Red hat Vault

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Ceph Intro & Architectural Overview. Federico Lucifredi Product Management Director, Ceph Storage Vancouver & Guadalajara, May 18th, 2015

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

Accelerate Ceph By SPDK on AArch64

Distributed File Storage in Multi-Tenant Clouds using CephFS

Ceph: scaling storage for the cloud and beyond

Ben Walker Data Center Group Intel Corporation

<Insert Picture Here> Filesystem Features and Performance

SolidFire and Ceph Architectural Comparison

SUSE Enterprise Storage Technical Overview

Ceph. The link between file systems and octopuses. Udo Seidel. Linuxtag 2012

Open vstorage RedHat Ceph Architectural Comparison

The Google File System

RED HAT CEPH STORAGE ROADMAP. Cesar Pinto Account Manager, Red Hat Norway

BTREE FILE SYSTEM (BTRFS)

COS 318: Operating Systems. Journaling, NFS and WAFL

Distributed File Storage in Multi-Tenant Clouds using CephFS

Operating Systems. File Systems. Thomas Ropars.

Samba and Ceph. Release the Kraken! David Disseldorp

Introduction to Ceph Speaker : Thor

Solid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Distributed File Systems II

NPTEL Course Jan K. Gopinath Indian Institute of Science

Google File System. Arun Sundaram Operating Systems

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

CLOUD-SCALE FILE SYSTEMS

CephFS: Today and. Tomorrow. Greg Farnum SC '15

Ceph Optimizations for NVMe

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

-Presented By : Rajeshwari Chatterjee Professor-Andrey Shevel Course: Computing Clusters Grid and Clouds ITMO University, St.

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

Accelerate block service built on Ceph via SPDK Ziye Yang Intel

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

CSE 124: Networked Services Lecture-16

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Red Hat Ceph Storage 3

CEPH DATA SERVICES IN A MULTI- AND HYBRID CLOUD WORLD. Sage Weil - Red Hat OpenStack Summit

Expert Days SUSE Enterprise Storage

Enterprise Ceph: Everyway, your way! Amit Dell Kyle Red Hat Red Hat Summit June 2016

Yuan Zhou Chendi Xue Jian Zhang 02/2017

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

CS3600 SYSTEMS AND NETWORKS

The Btrfs Filesystem. Chris Mason

File System Implementation

Discriminating Hierarchical Storage (DHIS)

Optimizing Ceph Object Storage For Production in Multisite Clouds

The Google File System

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Solid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

클라우드스토리지구축을 위한 ceph 설치및설정

The Btrfs Filesystem. Chris Mason

CS-580K/480K Advanced Topics in Cloud Computing. Object Storage

Ceph BlueStore Performance on Latest Intel Server Platforms. Orlando Moreno Performance Engineer, Intel Corporation May 10, 2018

The Tux3 File System

Design of Global Data Deduplication for A Scale-out Distributed Storage System

Using Transparent Compression to Improve SSD-based I/O Caches

SPDK Blobstore: A Look Inside the NVM Optimized Allocator

CA485 Ray Walshe Google File System

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

ECE 598 Advanced Operating Systems Lecture 19

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Understanding Write Behaviors of Storage Backends in Ceph Object Store

Cloud object storage : the right way. Orit Wasserman Open Source Summit 2018

Open Source Storage. Ric Wheeler Architect & Senior Manager April 30, 2012

GFS: The Google File System

Chapter 11: Implementing File Systems

Ceph: A Scalable, High-Performance Distributed File System PRESENTED BY, NITHIN NAGARAJ KASHYAP

TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Transcription:

GOODBYE, XFS: BUILDING A NEW, FASTER STORAGE BACKEND FOR CEPH SAGE WEIL RED HAT 2017.09.12

OUTLINE Ceph background and context FileStore, and why POSIX failed us BlueStore a new Ceph OSD backend Performance Recent challenges Current status, future Summary 2

MOTIVATION

CEPH Object, block, and file storage in a single cluster All components scale horizontally No single point of failure Hardware agnostic, commodity hardware Self-manage whenever possible Open source (LGPL) A Scalable, High-Performance Distributed File System performance, reliability, and scalability 4

Released two weeks ago Erasure coding support for block and file (and object) BlueStore (our new storage backend) 5

CEPH COMPONENTS OBJECT BLOCK FILE RGW A web services gateway for object storage, compatible with S3 and Swift RBD A reliable, fully-distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale-out metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors 6

OBJECT STORAGE DAEMONS (OSDS) M OSD OSD OSD OSD xfs btrfs ext4 M FS FS FS FS DISK DISK DISK DISK M 7

OBJECT STORAGE DAEMONS (OSDS) OSD OSD OSD OSD M xfs btrfs ext4 FileStore FileStore FileStore FileStore M FS FS FS FS DISK DISK DISK DISK M 8

IN THE BEGINNING EBOFS 2004 2008 2012 2014 2016 FakeStore 9

OBJECTSTORE AND DATA MODEL ObjectStore abstract interface for storing local data EBOFS, FakeStore Object file data (file-like byte stream) attributes (small key/value) omap (unbounded key/value) Collection directory actually a shard of RADOS pool Ceph placement group (PG) All writes are transactions Atomic + Consistent + Durable Isolation provided by OSD 10

LEVERAGE KERNEL FILESYSTEMS! EBOFS EBOFS removed 2004 2008 2012 2014 2016 11 FakeStore FileStore + BtrFS FileStore + XFS

FILESTORE FileStore evolution of FakeStore PG = collection = directory object = file Leveldb large xattr spillover object omap (key/value) data /var/lib/ceph/osd/ceph-123/ current/ meta/ osdmap123 osdmap124 0.1_head/ object1 object12 0.7_head/ object3 object5 0.a_head/ object4 object6 omap/ <leveldb files> 12

POSIX FAILS: TRANSACTIONS Most transactions are simple write some bytes to object (file) update object attribute (file xattr) append to update log (kv insert)...but others are arbitrarily large/complex Serialize and write-ahead txn to journal for atomicity We double-write everything! Lots of ugly hackery to make replayed events idempotent [ ] { }, { }, { } "op_name": "write", "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "length": 4194304, "offset": 0, "bufferlist length": 4194304 "op_name": "setattrs", "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "attr_lens": { "_": 269, "snapset": 31 } "op_name": "omap_setkeys", "collection": "0.6_head", "oid": "#0:60000000::::head#", "attr_lens": { "0000000005.00000000000000000006": 178, "_info": 847 } 13

POSIX FAILS: KERNEL CACHE Kernel cache is mostly wonderful automatically sized to all available memory automatically shrinks if memory needed efficient No cache implemented in app, yay! Read/modify/write vs write-ahead write must be persisted then applied to file system only then you can read it 14

POSIX FAILS: ENUMERATION Ceph objects are distributed by a 32-bit hash Enumeration is in hash order scrubbing data rebalancing, recovery enumeration via librados client API POSIX readdir is not well-ordered Need O(1) split for a given shard/range Build directory tree by hash-value prefix split big directories, merge small ones read entire directory, sort in-memory DIR_A/ DIR_A/A03224D3_qwer DIR_A/A247233E_zxcv DIR_B/ DIR_B/DIR_8/ DIR_B/DIR_8/B823032D_foo DIR_B/DIR_8/B8474342_bar DIR_B/DIR_9/ DIR_B/DIR_9/B924273B_baz DIR_B/DIR_A/ DIR_B/DIR_A/BA4328D2_asdf 15

THE HEADACHES CONTINUE New FileStore+XFS problems continue to surface FileStore directory splits lead to throughput collapse when an entire pool s PG directories split in unison Cannot bound deferred writeback work, even with fsync(2) QoS efforts thwarted by deep queues and periodicity in FileStore throughput {RBD, CephFS} snapshots triggering inefficient 4MB object copies to create object clones 16

ACTUALLY, OUR FIRST PLAN WAS BETTER EBOFS EBOFS removed NewStore BlueStore 2004 2008 2012 2014 2016 FakeStore FileStore + BtrFS FileStore + XFS 17

BLUESTORE

BLUESTORE BlueStore = Block + NewStore consume raw block device(s) key/value database (RocksDB) for metadata data written directly to block device pluggable inline compression full data checksums Target hardware Current gen HDD, SSD (SATA, SAS, NVMe) We must share the block device with RocksDB data ObjectStore BlueStore metadata RocksDB BlueRocksEnv BlueFS BlockDevice 19

ROCKSDB: BlueRocksEnv + BlueFS class BlueRocksEnv : public rocksdb::envwrapper passes file operations to BlueFS BlueFS is a super-simple file system all metadata lives in the journal all metadata loaded in RAM on start/mount journal rewritten/compacted when it gets large Map directories to different block devices db.wal/ on NVRAM, NVMe, SSD db/ level0 and hot SSTs on SSD db.slow/ cold SSTs on HDD BlueStore periodically balances free space no need to store block free list superblock journal data more journal data data data 20 file 10 file 11 file 12 file 12 file 13 rm file 12 file 13...

METADATA

ONODE OBJECT Per object metadata Lives directly in key/value pair Serializes to 100s of bytes Size in bytes Attributes (user attr data) Inline extent map (maybe) struct bluestore_onode_t { uint64_t size; map<string,bufferptr> attrs; uint64_t flags; // extent map metadata struct shard_info { uint32_t offset; uint32_t bytes; }; vector<shard_info> shards; bufferlist inline_extents; bufferlist spanning_blobs; }; 22

CNODE COLLECTION Collection metadata Interval of object namespace pool hash name bits C<12,3d3e0000> 12.e3d3 = <19> pool hash name snap O<12,3d3d880e,foo,NOSNAP> = O<12,3d3d9223,bar,NOSNAP> = O<12,3d3e02c2,baz,NOSNAP> = O<12,3d3e125d,zip,NOSNAP> = O<12,3d3e1d41,dee,NOSNAP> = struct spg_t { uint64_t pool; uint32_t hash; }; struct bluestore_cnode_t { uint32_t bits; }; Nice properties Ordered enumeration of objects We can split collections by adjusting collection metadata only O<12,3d3e3832,dah,NOSNAP> = 23

BLOB AND SHAREDBLOB Blob Extent(s) on device Checksum metadata Data may be compressed struct bluestore_blob_t { uint32_t flags = 0; vector<bluestore_pextent_t> extents; uint16_t unused = 0; // bitmap uint8_t csum_type = CSUM_NONE; uint8_t csum_chunk_order = 0; bufferptr csum_data; uint32_t compressed_length_orig = 0; uint32_t compressed_length = 0; }; SharedBlob Extent ref count on cloned blobs struct bluestore_shared_blob_t { uint64_t sbid; bluestore_extent_ref_map_t ref_map; }; 24

EXTENT MAP Map object extents blob extents Serialized in chunks stored inline in onode value if small otherwise stored in adjacent keys... O<,,foo,,> = onode + inline extent map O<,,bar,,> = onode + spanning blobs O<,,bar,,0> = extent map shard O<,,bar,,4> = extent map shard O<,,baz,,> = onode + inline extent map... Blobs stored inline in each shard unless it is referenced across shard boundaries spanning blobs stored in onode key logical offsets blob blob shard #1 spanning blob 25 blob shard #2

DATA PATH

DATA PATH BASICS Terms TransContext State describing an executing transaction Sequencer An independent, totally ordered queue of transactions One per PG Three ways to write New allocation Any write larger than min_alloc_size goes to a new, unused extent on disk Once that IO completes, we commit the transaction Unused part of existing blob Deferred writes Commit temporary promise to (over)write data with transaction includes data! Do async (over)write Then clean up temporary k/v pair 27

IN-MEMORY CACHE OnodeSpace per collection in-memory name Onode map of decoded onodes BufferSpace for in-memory Blobs all in-flight writes may contain cached on-disk data Both buffers and onodes have lifecycles linked to a Cache TwoQCache implements 2Q cache replacement algorithm (default) Cache is sharded for parallelism Collection shard mapping matches OSD's op_wq same CPU context that processes client requests will touch the LRU/2Q lists 28

PERFORMANCE

HDD: RANDOM WRITE 30 Throughput (MB/s) IOPS 450 400 350 300 250 200 150 100 50 0 4 2000 1800 1600 1400 1200 1000 800 600 400 200 0 4 8 8 Bluestore vs Filestore HDD Random Write Throughput 16 16 32 32 64 64 128 IO Size 256 512 1024 2048 4096 Bluestore vs Filestore HDD Random Write IOPS IO Size 128 256 512 1024 2048 4096 Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw)

HDD: SEQUENTIAL WRITE Bluestore vs Filestore HDD Sequential Write Throughput Throughput (MB/s) 450 400 350 300 250 200 150 100 50 0 4 8 16 32 64 128 256 512 1024 2048 4096 Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size Bluestore vs Filestore HDD Sequential Write IOPS WTH IOPS 3500 3000 2500 2000 1500 1000 500 0 4 8 16 32 64 128 256 512 1024 2048 4096 Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) 31 IO Size

RGW ON HDD+NVME, EC 4+2 4+2 Erasure Coding RadosGW Write Tests 32MB Objects, 24 HDD/NVMe OSDs on 4 Servers, 4 Clients Throughput (MB/s) 1800 1600 1400 1200 1000 800 600 400 200 0 1 Bucket 128 Buckets 4 Buckets 512 Buckets Rados 1 RGW Server 4 RGW Servers Bench Filestore 512KB Chunks Filestore 4MB Chunks Bluestore 512KB Chunks Bluestore 4MB Chunks 32

ERASURE CODE OVERWRITES Luminous allows overwrites of EC objects Requires two-phase commit to avoid RAID-hole like failure conditions OSD creates temporary rollback objects clone_range $extent to temporary object overwrite $extent with new data BlueStore can do this easily FileStore literally copies (reads and writes to-be-overwritten region) 33

BLUESTORE vs FILESTORE 3X vs EC 4+2 vs EC 5+1 RBD 4K Random Writes 16 HDD OSDs, 8 32GB volumes, 256 IOs in flight 900 800 700 IOPS 600 500 400 3X EC42 EC51 300 200 100 0 Bluestore HDD/HDD Filestore 34

OTHER CHALLENGES

USERSPACE CACHE Built mempool accounting infrastructure easily annotate/tag C++ classes and containers low overhead debug mode provides per-type (vs per-pool) accounting Requires configuration bluestore_cache_size (default 1GB) not as convenient as auto-sized kernel caches Finally have meaningful implementation of fadvise NOREUSE (vs DONTNEED) 36

MEMORY EFFICIENCY Careful attention to struct sizes: packing, redundant fields Checksum chunk sizes client hints to expect sequential read/write large csum chunks can optionally select weaker checksum (16 or 8 bits per chunk) In-memory red/black trees (e.g., std::map<>) bad for CPUs low temporal write locality many CPU cache misses, failed prefetches use btrees where appropriate per-onode slab allocators for extent and blob structs 37

ROCKSDB Compaction Overall impact grows as total metadata corpus grows Invalidates rocksdb block cache (needed for range queries) Awkward to control priority on kernel libaio interface Many deferred write keys end up in L0 High write amplification SSDs with low-cost random reads care more about total write overhead Open to alternatives for SSD/NVM 38

STATUS

STATUS Stable and recommended default in Luminous v12.2.z (just released!) Migration from FileStore Reprovision individual OSDs, let cluster heal/recover Current efforts Optimizing for CPU time SPDK support! Adding some repair functionality to fsck 40

FUTURE

FUTURE WORK Tiering hints to underlying block-based tiering device (e.g., dm-cache, bcache)? implement directly in BlueStore? host-managed SMR? maybe... Persistent memory + 3D NAND future? NVMe extensions for key/value, blob storage? 42

SUMMARY Ceph is great at scaling out POSIX was poor choice for storing objects Our new BlueStore backend is so much better Good (and rational) performance! Inline compression and full data checksums Designing internal storage interfaces without a POSIX bias is a win......eventually We are definitely not done yet Performance! Other stuff We can finally solve our IO problems ourselves 43

BASEMENT CLUSTER 2 TB 2.5 HDDs 1 TB 2.5 SSDs (SATA) 400 GB SSDs (NVMe) Luminous 12.2.0 CephFS Cache tiering Erasure coding BlueStore 44 Untrained IT staff!

THANK YOU! Sage Weil CEPH PRINCIPAL ARCHITECT sage@redhat.com @liewegas