Ceph Snapshots: Diving into Deep Waters. Greg Farnum Red hat Vault

Similar documents
Ceph Block Devices: A Deep Dive. Josh Durgin RBD Lead June 24, 2015

ROCK INK PAPER COMPUTER

Ceph Intro & Architectural Overview. Abbas Bangash Intercloud Systems

What's new in Jewel for RADOS? SAMUEL JUST 2015 VAULT

INTRODUCTION TO CEPH. Orit Wasserman Red Hat August Penguin 2017

virtual machine block storage with the ceph distributed storage system sage weil xensummit august 28, 2012

Distributed File Storage in Multi-Tenant Clouds using CephFS

Deploying Software Defined Storage for the Enterprise with Ceph. PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu

CephFS: Today and. Tomorrow. Greg Farnum SC '15

Datacenter Storage with Ceph

Ceph: scaling storage for the cloud and beyond

Ceph Rados Gateway. Orit Wasserman Fosdem 2016

CEPHALOPODS AND SAMBA IRA COOPER SNIA SDC

Ceph Intro & Architectural Overview. Federico Lucifredi Product Management Director, Ceph Storage Vancouver & Guadalajara, May 18th, 2015

DISTRIBUTED STORAGE AND COMPUTE WITH LIBRADOS SAGE WEIL VAULT

Cloud object storage in Ceph. Orit Wasserman Fosdem 2017

Distributed File Storage in Multi-Tenant Clouds using CephFS

Ceph: A Scalable, High-Performance Distributed File System PRESENTED BY, NITHIN NAGARAJ KASHYAP

A fields' Introduction to SUSE Enterprise Storage TUT91098

Jason Dillaman RBD Project Technical Lead Vault Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring

architecting block and object geo-replication solutions with ceph sage weil sdc

Samba and Ceph. Release the Kraken! David Disseldorp

MySQL and Ceph. A tale of two friends

THE CEPH POWER SHOW. Episode 2 : The Jewel Story. Daniel Messer Technical Marketing Red Hat Storage. Karan Singh Sr. Storage Architect Red Hat Storage

Ceph. The link between file systems and octopuses. Udo Seidel. Linuxtag 2012

Why software defined storage matters? Sergey Goncharov Solution Architect, Red Hat

A Gentle Introduction to Ceph

Red Hat Ceph Storage Ceph Block Device

WHAT S NEW IN LUMINOUS AND BEYOND. Douglas Fuller Red Hat

GOODBYE, XFS: BUILDING A NEW, FASTER STORAGE BACKEND FOR CEPH SAGE WEIL RED HAT

the ceph distributed storage system sage weil msst april 17, 2012

The Google File System

SEP sesam Backup & Recovery to SUSE Enterprise Storage. Hybrid Backup & Disaster Recovery

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

-Presented By : Rajeshwari Chatterjee Professor-Andrey Shevel Course: Computing Clusters Grid and Clouds ITMO University, St.

Red Hat Ceph Storage Ceph Block Device

Introduction to Ceph Speaker : Thor

SolidFire and Ceph Architectural Comparison

Ceph: A Scalable, High-Performance Distributed File System

GFS: The Google File System

Understanding Write Behaviors of Storage Backends in Ceph Object Store

CephFS A Filesystem for the Future

Virtual File System. Don Porter CSE 306

Ceph as (primary) storage for Apache CloudStack. Wido den Hollander

9 May Swifta. A performant Hadoop file system driver for Swift. Mengmeng Liu Andy Robb Ray Zhang

Choosing an Interface

GlusterFS and RHS for SysAdmins

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

클라우드스토리지구축을 위한 ceph 설치및설정

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Expert Days SUSE Enterprise Storage

Distributed File Systems II

Fall 2017 :: CSE 306. File Systems Basics. Nima Honarmand

Virtual File System. Don Porter CSE 506

SUSE Enterprise Storage Technical Overview

Using persistent memory and RDMA for Ceph client write-back caching Scott Peterson, Senior Software Engineer Intel

CS-580K/480K Advanced Topics in Cloud Computing. Object Storage

Files and the Filesystems. Linux Files

Ceph at the DRI. Peter Tiernan Systems and Storage Engineer Digital Repository of Ireland TCHPC

The Google File System

CEPH APPLIANCE Take a leap into the next generation of enterprise storage

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

GFS: The Google File System. Dr. Yingwu Zhu

CLOUD-SCALE FILE SYSTEMS

CS 111. Operating Systems Peter Reiher

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

The HAMMER Filesystem DragonFlyBSD Project Matthew Dillon 11 October 2008

NPTEL Course Jan K. Gopinath Indian Institute of Science

Open vstorage RedHat Ceph Architectural Comparison

Discover CephFS TECHNICAL REPORT SPONSORED BY. image vlastas, 123RF.com

The Google File System

The CephFS Gateways Samba and NFS-Ganesha. David Disseldorp Supriti Singh

Distributed File Systems

ZFS Async Replication Enhancements Richard Morris Principal Software Engineer, Oracle Peter Cudhea Principal Software Engineer, Oracle

Red Hat Ceph Storage 3

CEPH DATA SERVICES IN A MULTI- AND HYBRID CLOUD WORLD. Sage Weil - Red Hat OpenStack Summit

an Object-Based File System for Large-Scale Federated IT Infrastructures

Object based Storage Cluster File Systems & Parallel I/O

Linux Clustering & Storage Management. Peter J. Braam CMU, Stelias Computing, Red Hat

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Red Hat Ceph Storage 2 Architecture Guide

XtreemFS: highperformance network file system clients and servers in userspace. Minor Gordon, NEC Deutschland GmbH

SDS Heterogeneous OS Access. Technical Strategist

RCU. ò Walk through two system calls in some detail. ò Open and read. ò Too much code to cover all FS system calls. ò 3 Cases for a dentry:

Ceph, Xen, and CloudStack: Semper Melior. Xen User Summit New Orleans, LA 18 SEP 2013

VFS, Continued. Don Porter CSE 506

Enterprise Ceph: Everyway, your way! Amit Dell Kyle Red Hat Red Hat Summit June 2016

Webtalk Storage Trends

<Insert Picture Here> Btrfs Filesystem

RCU. ò Dozens of supported file systems. ò Independent layer from backing storage. ò And, of course, networked file system support

The Journalling Flash File System

Current Topics in OS Research. So, what s hot?

What is a file system

The Google File System. Alexandru Costan

BigTable. CSE-291 (Cloud Computing) Fall 2016

The Google File System

Red Hat Ceph Storage 3

Caching and reliability

The Google File System

SUSE Enterprise Storage 3

Transcription:

Ceph Snapshots: Diving into Deep Waters Greg Farnum Red hat Vault 2017.03.23

Hi, I m Greg Greg Farnum Principal Software Engineer, Red Hat gfarnum@redhat.com 2

Outline RADOS, RBD, CephFS: (Lightning) overview and how writes happen The (self-managed) snapshots interface A diversion into pool snapshots Snapshots in RBD, CephFS RADOS/OSD Snapshot implementation, pain points 3

Ceph s Past & Present Then: UC Santa Cruz Storage Research Systems Center Long-term research project in petabytescale storage trying to develop a Lustre successor. Now: Red Hat, a commercial open-source software & support provider you might have heard of :) (Mirantis, SuSE, Canonical, 42on, Hastexo,...) Building a business; customers in virtual block devices and object storage...and reaching for filesystem users! 4

Ceph Projects OBJECT BLOCK FILE RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication RBD A virtual block device with snapshots, copy-on-write clones, and multi-site replication CEPHFS A distributed POSIX file system with coherent caches and snapshots on any directory LIBRADOS A library allowing apps to direct access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomic, distributed object store comprised of self-healing, self-managing, intelligent storage nodes (OSDs) and lightweight monitors (Mons) 5

RADOS: Overview

RADOS Components OSDs: 10s to 10000s in a cluster One per disk (or one per SSD, RAID group ) Serve stored objects to clients Intelligently peer for replication & recovery Monitors: M Maintain cluster membership and state Provide consensus for distributed decisionmaking Small, odd number 7 These do not serve stored objects to clients

Object Storage Daemons M OSD OSD OSD OSD M FS FS FS FS DISK DISK DISK DISK M 8

CRUSH: Dynamic Data Placement CRUSH: Pseudo-random placement algorithm Fast calculation, no lookup Repeatable, deterministic Statistically uniform distribution Stable mapping Limited data migration on change Rule-based configuration Infrastructure topology aware Adjustable replication 9 Weighting

DATA IS ORGANIZED INTO POOLS OBJECTS POOL A 10 01 11 01 10 01 01 10 10 01 01 11 OBJECTS POOL B 01 10 10 01 11 01 10 01 01 01 10 01 OBJECTS POOL C 10 01 10 11 01 11 10 10 01 10 01 01 OBJECTS POOL D 11 10 01 10 10 10 01 01 01 01 10 01 10 0 POOLS (CONTAINING PGs) CLUSTER

librados: RADOS Access for Apps LIBRADOS: L Direct access to RADOS for applications C, C++, Python, PHP, Java, Erlang Direct access to storage nodes No HTTP overhead Rich object API Bytes, attributes, key/value data Partial overwrite of existing data Single-object compound atomic operations 11 RADOS classes (stored procedures)

RADOS: The Write Path (user) aio_write(const object_t &oid, AioCompletionImpl *c, const bufferlist& bl, size_t len, uint64_t off); c->wait_for_safe(); write(const std::string& oid, bufferlist& bl, size_t len, uint64_t off) 12

RADOS: The Write Path (Network) Client Primary Replica L 13 3

RADOS: The Write Path (OSD) Queue write for PG Lock PG Assign order to write op Package it for persistent storage Find current object state, etc Send to replica op Send to local persistent storage Unlock PG Wait for commits from persistent storage and replicas Send commit back to client 14

RBD: Overview

STORING VIRTUAL DISKS VM HYPERVISOR LIBRBD M M RADOS CLUSTER 16 6

RBD STORES VIRTUAL DISKS RADOS BLOCK DEVICE: Storage of disk images in RADOS Decouples VMs from host Images are striped across the cluster (pool) Snapshots Copy-on-write clones Support in: Mainline Linux Kernel (2.6.39+) Qemu/KVM, native Xen coming soon OpenStack, CloudStack, Nebula, Proxmox 17 7

RBD: The Write Path ssize_t Image::write(uint64_t ofs, size_t len, bufferlist& bl) int Image::aio_write(uint64_t off, size_t len, bufferlist& bl, RBD::AioCompletion *c) 18

CephFS: Overview

LINUX HOST KERNEL MODULE Ceph-fuse, samba, Ganesha metadata 01 10 data M M M RADOS CLUSTER 20

CephFS: The Write Path (User) extern "C" int ceph_write(struct ceph_mount_info *cmount, int fd, const char *buf, int64_t size, int64_t offset) 21

CephFS: The Write Path (Network) Client MDS OSD L 22

CephFS: The Write Path Request write capability from MDS if not already present Get cap from MDS Write new data to ObjectCacher (Inline or later when flushing) Send write to OSD Receive commit from OSD Return to caller 23

The Origin of Snapshots

25 [john@schist backups]$ touch history [john@schist backups]$ cd.snap [john@schist.snap]$ mkdir snap1 [john@schist.snap]$ cd.. [john@schist backups]$ rm -f history [john@schist backups]$ ls [john@schist backups]$ ls.snap/snap1 history # Deleted file still there in the snapshot!

Snapshot Design: Goals & Limits For CephFS Arbitrary subtrees: lots of seemingly-unrelated objects snapshotting together Must be cheap to create We have external storage for any desired snapshot metadata 26

Snapshot Design: Outcome Snapshots are per-object Driven on object write So snaps which logically apply to any object don t touch it if it s not written Very skinny data per-object list of existing snaps Global list of deleted snaps 27

RADOS: Self-managed snapshots

Librados snaps interface int set_snap_write_context(snapid_t seq, vector<snapid_t>& snaps); int selfmanaged_snap_create(uint64_t *snapid); void aio_selfmanaged_snap_create(uint64_t *snapid, AioCompletionImpl *c); int selfmanaged_snap_remove(uint64_t snapid); void aio_selfmanaged_snap_remove(uint64_t snapid, AioCompletionImpl *c); int selfmanaged_snap_rollback_object(const object_t& oid, ::SnapContext& snapc, uint64_t snapid); 29

Allocating Self-managed Snapshots snapids are allocated by incrementing the snapid and snap_seq members of the per-pool pg_pool_t OSDMap struct 30

Allocating Self-managed Snapshots Client Monitor Peons L M M Disk commit 31 1

Allocating Self-managed Snapshots Client Monitor Peons L M M...or just make them up yourself (CephFS does so in the MDS) Disk commit 32 2

Librados snaps interface int set_snap_write_context(snapid_t seq, vector<snapid_t>& snaps); int selfmanaged_snap_create(uint64_t *snapid); void aio_selfmanaged_snap_create(uint64_t *snapid, AioCompletionImpl *c); int selfmanaged_snap_remove(uint64_t snapid); void aio_selfmanaged_snap_remove(uint64_t snapid, AioCompletionImpl *c); int selfmanaged_snap_rollback_object(const object_t& oid, ::SnapContext& snapc, uint64_t snapid); 33

Writing With Snapshots write(const std::string& oid, bufferlist& bl, size_t len, uint64_t off) Client Primary Replica L 34 4

Snapshots: The OSD Path Queue write for PG Lock PG Assign order to write op Package it for persistent storage Find current object state, etc make_writeable() Send to replica op Send to local persistent storage Wait for commits from persistent storage and replicas Send commit back to client 35

Snapshots: The OSD Path The PrimaryLogPG::make_writeable() function Is the SnapContext newer than the object already has on disk? (Create a transaction to) clone the existing object Update the stats and clone range overlap information PG::append_log() calls update_snap_map() Updates the SnapMapper, which maintains LevelDB entries from: snapid object And Object snapid 36

Snapshots: OSD Data Structures struct SnapSet { snapid_t seq; bool head_exists; vector<snapid_t> snaps; // descending vector<snapid_t> clones; // ascending map<snapid_t, interval_set<uint64_t> > clone_overlap; map<snapid_t, uint64_t> clone_size; } This is attached to the HEAD object in an xattr 37

RADOS: Pool Snapshots :(

Pool Snaps: Desire Make snapshots easy for admins Leverage the existing per-object implementation Overlay the correct SnapContext automatically on writes Spread that SnapContext via the OSDMap 39

Librados pool snaps interface int snap_list(vector<uint64_t> *snaps); int snap_lookup(const char *name, uint64_t *snapid); int snap_get_name(uint64_t snapid, std::string *s); int snap_get_stamp(uint64_t snapid, time_t *t); int snap_create(const char* snapname); int snap_remove(const char* snapname); int rollback(const object_t& oid, const char *snapname); Note how that s still per-object! 40

Pool Snaps: Reality Spread that SnapContext via the OSDMap It s not a point-in-time snapshot SnapContext spread virally as OSDMaps get pushed out No guaranteed temporal order between two different RBD volumes in the pool even when attached to the same VM :( Inflates the OSDMap size: per-pool map<snapid_t, pool_snap_info_t> snaps; struct pool_snap_info_t { snapid_t snapid; utime_t stamp; string name; } They are unlikely to solve a problem you want 41

Pool Snaps: Reality Overlay the correct SnapContext automatically on writes No sensible way to merge that with a self-managed SnapContext...so we don t: pick one or the other for a pool All in all, pool snapshots are unlikely to usefully solve any problems. 42

RBD: Snapshot Structures

RBD Snapshots: Data Structures 44 struct cls_rbd_snap { snapid_t id; string name; uint64_t image_size; uint64_t features; uint8_t protection_status; cls_rbd_parent parent; uint64_t flags; utime_t timestamp; cls::rbd::snapshotnamespaceondisk snapshot_namespace; }

RBD Snapshots: Data Structures cls_rbd_snap for every snapshot Stored in omap (read: LevelDB) key-value space on the RBD volume s header object RBD object class exposes get_snapcontext() function, called on mount RBD clients watch on the header, get notify when a new snap is created to update themselves 45

CephFS: Snapshot Structures

CephFS Snapshots: Goals & Limits For CephFS Arbitrary subtrees: lots of seemingly-unrelated objects snapshotting together Must be cheap to create We have external storage for any desired snapshot metadata 47

CephFS Snapshots: Memory Directory Cinodes have SnapRealms Important elements: snapid_t seq; snapid_t created; snapid_t last_created; snapid_t last_destroyed; // a version/seq # for changes to _this_ realm. // when this realm was created. // last snap created in _this_ realm. // seq for last removal snapid_t current_parent_since; map<snapid_t, SnapInfo> snaps; map<snapid_t, snaplink_t> past_parents; // key is "last" (or NOSNAP) 48

CephFS Snapshots: SnapRealms /home greg sage ric 49

CephFS Snapshots: SnapRealms /home greg sage ric 50

CephFS Snapshots: SnapRealms /home greg sage ric 51

CephFS Snapshots: SnapRealms /home greg sage ric 52

CephFS Snapshots: SnapRealms Directory Cinodes have SnapRealms Important elements: snapid_t seq; snapid_t created; snapid_t last_created; snapid_t last_destroyed; // a version/seq # for changes to _this_ realm. // when this realm was created. // last snap created in _this_ realm. // seq for last removal snapid_t current_parent_since; map<snapid_t, SnapInfo> snaps; Construct the SnapContext! map<snapid_t, snaplink_t> past_parents; // key is "last" (or NOSNAP) 53

CephFS Snapshots: Memory All CInodes have old_inode_t map representing its past states for snapshots struct old_inode_t { snapid_t first; inode_t inode; std::map<string,bufferptr> xattrs; } 54

CephFS Snapshots: Disk SnapRealms are encoded as part of inode Snapshotted metadata stored as old_inode_t map in memory/disk Snapshotted data stored in RADOS object self-managed snapshots /<ino 0,v2>/home<ino 1,v5>/greg<ino 5,v9>/ Mydir[01], total size 7MB 1342.0/HEAD /<v2>/home<v5>/greg<v9>/foo foo -> ino 1342, 4 MB, [<1>,<3>,<10>] bar -> ino 1001, 1024 KBytes baz -> ino 1242, 2 MB 55

CephFS Snapshots Arbitrary sub-tree snapshots of the hierarchy Metadata stored as old_inode_t map in memory/disk Data stored in RADOS object snapshots 1342.0/1 /<v1>/home<v3>/greg<v7>/foo 1342.0/HEAD /<v2>/home<v5>/greg<v9>/foo 56

CephFS: Snapshot Pain

CephFS Pain: Opening past parents Directory Cinodes have SnapRealms Important elements: snapid_t seq; snapid_t created; snapid_t last_created; snapid_t last_destroyed; // a version/seq # for changes to _this_ realm. // when this realm was created. // last snap created in _this_ realm. // seq for last removal snapid_t current_parent_since; map<snapid_t, SnapInfo> snaps; map<snapid_t, snaplink_t> past_parents; 58

CephFS Pain: Opening past parents To construct the SnapContext for a write, we need the all the SnapRealms it has ever participated in Because it could have been logically snapshotted in an old location but not written to since, and a new write must reflect that old location s snapid So we must open all the directories the file has been a member of! With a single MDS, this isn t too hard With multi-mds, this can be very difficult in some scenarios We may not know who is authoritative for a directory under all failure and recovery scenarios If there s been a disaster metadata may be inaccessible, but we don t have mechanisms for holding operations and retrying when unrelated metadata is inaccessible 59

CephFS Pain: Opening past parents Directory Cinodes have SnapRealms Important elements: snapid_t seq; snapid_t created; snapid_t last_created; snapid_t last_destroyed; // a version/seq # for changes to _this_ realm. // when this realm was created. // last snap created in _this_ realm. // seq for last removal snapid_t current_parent_since; map<snapid_t, SnapInfo> snaps; Why not store snaps in all descendants Instead of maintaining ancestor links? map<snapid_t, snaplink_t> past_parents; 60

CephFS Pain: Eliminating past parents The MDS opens an inode for any operation performed on it This includes its SnapRealm So we can merge snapid lists down whenever we open an inode that has a new SnapRealm So if we rename a directory/file into a new location; its SnapRealm already contains all the right snapids and then we don t need a link to the past! 61 I got this almost completely finished Reduced code line count Much simpler snapshot tracking code But.

CephFS Pain: Hard links Hard links and snapshots do not interact :( They should! That means we need to merge SnapRealms from all the linked parents of an inode And this is the exact same problem we have with past_parents Since we need to open remote inodes correctly, avoiding it in the common case doesn t help us So, back to debugging and more edge cases 62

RADOS: Deleting Snapshots

Librados snaps interface int set_snap_write_context(snapid_t seq, vector<snapid_t>& snaps); int selfmanaged_snap_create(uint64_t *snapid); void aio_selfmanaged_snap_create(uint64_t *snapid, AioCompletionImpl *c); int selfmanaged_snap_remove(uint64_t snapid); void aio_selfmanaged_snap_remove(uint64_t snapid, AioCompletionImpl *c); int selfmanaged_snap_rollback_object(const object_t& oid, ::SnapContext& snapc, uint64_t snapid); 64

Deleting Snapshots (Client) Client Monitor Peons L M M Disk commit 65 5

Deleting Snapshots (Monitor) Generate new OSDMap updating pg_pool_t interval_set<snapid_t> removed_snaps; This is really space-efficient if you consistently delete your oldest snapshots! Rather less so if you keep every other one forever...and this looks sort of like some sensible RBD snapshot strategies (daily for a week, weekly for a month, monthly for a year) 66

Deleting Snapshots (OSD) OSD advances its OSDMap Asynchronously: List objects with that snapshot via SnapMapper int get_next_objects_to_trim( snapid_t snap, unsigned max, vector<hobject_t> *out); For each object: unlink object clone for that snapshot 1 coalescable IO Sometimes clones belong to multiple snaps so we might not delete right away Update object HEAD s SnapSet xattr 1+ unique IO Remove SnapMapper s LevelDB entries for that object/snap pair 1 coalescable IO Write down PGLog entry recording clone removal 1 coalescable IO Note that if you trim a bunch of snaps, you do this for each one no coalescing it down to one pass on each object :( 67

Deleting Snapshots (OSD) So that s at least 1 IO per object in a snap potentially a lot more if we needed to fetch KV data off disk, didn t have directories cached, etc This will be a lot better in BlueStore! It s just coalescable metadata ops Ouch! Even worse: throttling is hard Why is a whole talk on its own It s very difficult to not overwhelm clusters if you do a lot of trimming at once 68

RADOS: Alternate Approaches

Past: Deleting Snapshots Maintain a per snapid directory with hard links! Every clone is linked into (all) its snapid directory(s) Just list the directory to identify them, then update the object s SnapSet Unlink from all relevant directories Turns out this destroys locality, in addition to being icky code 70

Present: Why per-object? For instance, LVM snapshots? We don t want to snapshot everything on an OSD at once No implicit consistency groups across RBD volumes, for instance So we ultimately need a snap object mapping, since each snap touches so few objects 71

Future: Available enhancements Update internal interfaces for more coalescing There s no architectural reason we need to scan each object per-snapshot Instead, maintain iterators for each snapshot we are still purging and advance them through the keyspace in step so we can do all snapshots of a particular object in one go Change deleted snapshots representation so it doesn t inflate ODSMaps deleting_snapshots instead, which can be trimmed once all OSDs report they ve been removed Store the full list of deleted snapshots in config-keys or similar, handle it with ceph-mgr 72

Future: Available enhancements BlueStore: It makes everything better Stop having to map our semantics onto a filesystem Snapshot deletes still require the snapid object mapping, but the actual delete is a few key updates rolled into RocksDB easy to coalesce Better throttling for users Short-term: hacks to enable sleeps so we don t overwhelm the local FS Long-term: proper cost estimates for BlueStore that we can budget correctly (not really possible in Fses since they don t expose number of IOs needed to flush current dirty state) 73

THANK YOU! Greg Farnum Principal Engineer, Ceph gfarnum@redhat.com @gregsfortytwo