What's new in Jewel for RADOS? SAMUEL JUST 2015 VAULT

QUICK PRIMER ON CEPH AND RADOS

CEPH MOTIVATING PRINCIPLES All components must scale horizontally There can be no single point of failure The solution must be hardware agnostic Should use commodity hardware Self-manage whenever possible Open source (LGPL) 3

CEPH COMPONENTS APP HOST/VM CLIENT RGW A web services gateway for object storage, compatible with S3 and Swift RBD A reliable, fullydistributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scaleout metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors 4

RADOS Strong consistency (CP system) Flat object namespace within each pool Infrastructure aware, dynamic topology Hash-based placement (CRUSH) Direct client to server data path 5

LIBRADOS Interface Rich object API Bytes, attributes, key/value data Partial overwrite of existing data Single-object compound atomic operations RADOS classes (stored procedures) 6

RADOS COMPONENTS s: 10s to 1000s in a cluster One per disk (or one per SSD, RAID group ) Serve stored objects to clients Intelligently peer for replication & recovery 7

OBJECT STORAGE DAEMONS M xfs btrfs ext4 M FS FS FS FS DISK DISK DISK DISK M 8

RADOS COMPONENTS Monitors: M Maintain cluster membership and state Provide consensus for distributed decisionmaking Small, odd number (e.g., 5) Not part of data path 9

DATA PLACEMENT AND PG TEMP PRIMING

CALCULATED PLACEMENT M A-G APPLICATION F M H-N O-T M U-Z 11

CRUSH 10 10 01 01 11 01 01 01 01 10 01 OBJECTS 10 10 01 11 10 10 01 11 01 10 10 01 01 12 PLACEMENT GROUPS (PGs) CLUSTER

CRUSH IS A QUICK CALCULATION 10 01 01 11 01 01 10 01 OBJECT 01 11 10 10 10 10 01 01 RADOS CLUSTER 13

CRUSH IS A QUICK CALCULATION 10 01 01 11 01 01 10 01 OBJECT 01 11 10 10 10 10 01 01 RADOS CLUSTER 14

CRUSH AVOIDS FAILED DEVICES 10 01 01 11 01 01 10 01 OBJECT 01 11 10 10 10 10 10 01 01 RADOS CLUSTER 15

Recovery, Backfill, and PG Temp The state of a PG on an is represented by a PG Log which contains the most recent operations witnessed by that. The authoritative state of a PG after Peering is represented by constructing an authoritative PG Log from an up-to-date peer. If a peer's PG Log overlaps the authoritative PG Log, we can construct a set of out-of-date objects to recover Otherwise, we do not know which objects are out-ofdate, and we must perform Backfill 16

Recovery, Backfill, and PG Temp If we do not know which objects are invalid, we certainly cannot serve reads from such a peer If after peering the primary determines that it or any other peer requires Backfill, it will request that the Monitor cluster publish a new map with an exception to the CRUSH mapping for this PG mapping it to the best set of up-to-date peers that it can find. Once that map is published, peering will happen again, and the up-to-date peers will independently conclude that they should serve reads and writes while concurrently backfilling the correct peers 17

Backfill Example Epoch 1130: 0 1 2 18

Backfill Example Epoch 1130: 0 1 2 Epoch 1340: 3 4 5 19

Backfill Example Epoch 1130: 0 1 2 Epoch 1340: 3 4 5 Epoch 1345: 0 1 2 20

Backfill Example Epoch 1130: 0 1 2 Epoch 1340: 3 4 5 Epoch 1345: 0 1 2 Epoch 1391: 3 4 5 21

Backfill With PG Temp Priming Epoch 1130: 0 1 2 Epoch 1340: 3 4 5 22

Backfill With PG Temp Priming Epoch 1130: 0 1 2 Epoch 1340: 3 4 5 Epoch 1345: 0 1 2 23

Backfill With PG Temp Priming Epoch 1130: 0 1 2 Epoch 1340: 0 1 2 Epoch 1391: 3 4 5 24

PG Temp Priming Caveats Puts stress on the Monitors since the set of PGs is linear in the size of the cluster. Strategy: the Monitors spend bounded time scanning for these cases and give up when the time limit expires When a new map only involves s being weighted down or marked down, we can get away with only scanning those PGs which currently map to those s When a new map involves s being weighted up or marked up, we need to scan all PGs (can't predict which ones moved to those s). The s are perfectly happy whether the PG Temp mapping is pre-set or not. If not, they simply fall back to the old behavior. 25

PG Temp Priming Config options mon_osd_prime_pg_temp mon_osd_prime_pg_temp_max_time Enabled by default in Jewel Thanks Sage! 26

Cache/Tiering and Jewel

CEPH AND CACHING APPLICATION CACHE POOL BACKING POOL CEPH STORAGE CLUSTER 28

RADOS TIERING PRINCIPLES Each tier is a RADOS pool May be replicated or erasure coded Tiers are durable e.g., replicate across SSDs in multiple hosts Each tier has its own CRUSH policy e.g., map cache pool to SSDs devices/hosts only librados adapts to tiering topology Transparently direct requests accordingly e.g., to cache No changes to RBD, RGW, CephFS, etc. 29

READ (CACHE HIT) CEPH CLIENT READ READ REPLY CACHE POOL (SSD): WRITEBACK BACKING POOL (HDD) CEPH STORAGE CLUSTER 30

READ (CACHE MISS) CEPH CLIENT READ READ REPLY CACHE POOL (SSD): WRITEBACK PROMOTE BACKING POOL (HDD) CEPH STORAGE CLUSTER 31

READ REDIRECT (CACHE MISS) CEPH CLIENT READ REDIRECT READ READ REPLY CACHE POOL (SSD): WRITEBACK BACKING POOL (HDD) CEPH STORAGE CLUSTER 32

READ PROXY (CACHE MISS) CEPH CLIENT READ READ REPLY CACHE POOL (SSD): WRITEBACK PROXY READ PROMOTE? BACKING POOL (HDD) CEPH STORAGE CLUSTER 33

WRITE INTO CACHE POOL CEPH CLIENT WRITE ACK CACHE POOL (SSD): WRITEBACK BACKING POOL (HDD) CEPH STORAGE CLUSTER 34

WRITE (MISS) CEPH CLIENT WRITE ACK CACHE POOL (SSD): WRITEBACK PROMOTE BACKING POOL (HDD) CEPH STORAGE CLUSTER 35

WRITE PROXY (MISS) CEPH CLIENT WRITE ACK CACHE POOL (SSD): WRITEBACK PROXY WRITE BACKING POOL (HDD) CEPH STORAGE CLUSTER 36

ERASURE CODING AND FAST READS

ERASURE CODING OBJECT OBJECT COPY COPY COPY 1 2 3 4 X Y 38 REPLICATED POOL CEPH STORAGE CLUSTER Full copies of stored objects Very high durability 3x (200% overhead) Quicker recovery ERASURE CODED POOL CEPH STORAGE CLUSTER One copy plus parity Cost-effective durability 1.5x (50% overhead) Expensive recovery

ERASURE CODING SHARDS OBJECT 1 2 3 4 X Y ERASURE CODED POOL CEPH STORAGE CLUSTER 39

ERASURE CODING SHARDS 1 2 3 4 X Y 0 1 2 3 A A' 4 5 6 7 B B' 8 9 10 9 C C' 12 13 14 15 D D' 16 17 18 19 E E' Variable stripe size (e.g., 4 KB) CEPH STORAGE CLUSTER Zero-fill shards (logically) in partial tail stripe 40

PRIMARY 1 2 3 4 X Y ERASURE CODED POOL CEPH STORAGE CLUSTER 41

EC WRITE CEPH CLIENT WRITE 1 2 3 4 X Y ERASURE CODED POOL CEPH STORAGE CLUSTER 42

EC WRITE CEPH CLIENT WRITE 1 2 3 4 X Y WRITES ERASURE CODED POOL CEPH STORAGE CLUSTER 43

EC WRITE CEPH CLIENT WRITE ACK 1 2 3 4 X Y ERASURE CODED POOL CEPH STORAGE CLUSTER 44

EC WRITE: DEGRADED WRITE CEPH CLIENT 1 2 3 4 X Y WRITES ERASURE CODED POOL CEPH STORAGE CLUSTER 45

EC WRITE: PARTIAL FAILURE WRITE CEPH CLIENT 1 2 3 4 X Y WRITES ERASURE CODED POOL CEPH STORAGE CLUSTER 46

EC WRITE: PARTIAL FAILURE CEPH CLIENT 1 2 3 4 X Y B B A B A A ERASURE CODED POOL CEPH STORAGE CLUSTER 47

EC RESTRICTIONS Overwrite in place will not work in general Log and 2PC would increase complexity, latency We chose to restrict allowed operations create append (on stripe boundary) remove (keep previous generation of object for some time) These operations can all easily be rolled back locally create delete append truncate remove roll back to previous generation Object attrs preserved in existing PG logs (they are small) Key/value data is not allowed on EC pools 48

EC READ CEPH CLIENT READ 1 2 3 4 X Y ERASURE CODED POOL CEPH STORAGE CLUSTER 49

EC READ CEPH CLIENT READ 1 2 3 4 X Y READS ERASURE CODED POOL CEPH STORAGE CLUSTER 50

EC READ READ REPLY CEPH CLIENT 1 2 3 4 X Y ERASURE CODED POOL CEPH STORAGE CLUSTER 51

EC READ LONG TAIL READ CEPH CLIENT 1 2 3 4 X Y READS ERASURE CODED POOL CEPH STORAGE CLUSTER 52

EC READ LONG TAIL READ CEPH CLIENT 1 2 3 4 X Y READS ERASURE CODED POOL CEPH STORAGE CLUSTER 53

EC READ LONG TAIL READ REPLY CEPH CLIENT 1 2 3 4 X Y READS ERASURE CODED POOL CEPH STORAGE CLUSTER 54

EC FAST READ CEPH CLIENT READ 1 2 3 4 X Y READS ERASURE CODED POOL CEPH STORAGE CLUSTER 55

EC FAST READ READ REPLY CEPH CLIENT 1 2 3 4 X Y READS ERASURE CODED POOL CEPH STORAGE CLUSTER 56

EC FAST READ fast_read can be enabled on a pool using: ceph osd pool set <pool_name> fast_read true If mon_osd_pool_ec_fast_read is set to true, the mons will create ec pools with fast_read enabled by default. Thanks to Guang Yang at Yahoo! 57

ROADMAP