What's new in Jewel for RADOS? SAMUEL JUST 2015 VAULT
QUICK PRIMER ON CEPH AND RADOS
CEPH MOTIVATING PRINCIPLES All components must scale horizontally There can be no single point of failure The solution must be hardware agnostic Should use commodity hardware Self-manage whenever possible Open source (LGPL) 3
CEPH COMPONENTS APP HOST/VM CLIENT RGW A web services gateway for object storage, compatible with S3 and Swift RBD A reliable, fullydistributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scaleout metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors 4
RADOS Strong consistency (CP system) Flat object namespace within each pool Infrastructure aware, dynamic topology Hash-based placement (CRUSH) Direct client to server data path 5
LIBRADOS Interface Rich object API Bytes, attributes, key/value data Partial overwrite of existing data Single-object compound atomic operations RADOS classes (stored procedures) 6
RADOS COMPONENTS s: 10s to 1000s in a cluster One per disk (or one per SSD, RAID group ) Serve stored objects to clients Intelligently peer for replication & recovery 7
OBJECT STORAGE DAEMONS M xfs btrfs ext4 M FS FS FS FS DISK DISK DISK DISK M 8
RADOS COMPONENTS Monitors: M Maintain cluster membership and state Provide consensus for distributed decisionmaking Small, odd number (e.g., 5) Not part of data path 9
DATA PLACEMENT AND PG TEMP PRIMING
CALCULATED PLACEMENT M A-G APPLICATION F M H-N O-T M U-Z 11
CRUSH 10 10 01 01 11 01 01 01 01 10 01 OBJECTS 10 10 01 11 10 10 01 11 01 10 10 01 01 12 PLACEMENT GROUPS (PGs) CLUSTER
CRUSH IS A QUICK CALCULATION 10 01 01 11 01 01 10 01 OBJECT 01 11 10 10 10 10 01 01 RADOS CLUSTER 13
CRUSH IS A QUICK CALCULATION 10 01 01 11 01 01 10 01 OBJECT 01 11 10 10 10 10 01 01 RADOS CLUSTER 14
CRUSH AVOIDS FAILED DEVICES 10 01 01 11 01 01 10 01 OBJECT 01 11 10 10 10 10 10 01 01 RADOS CLUSTER 15
Recovery, Backfill, and PG Temp The state of a PG on an is represented by a PG Log which contains the most recent operations witnessed by that. The authoritative state of a PG after Peering is represented by constructing an authoritative PG Log from an up-to-date peer. If a peer's PG Log overlaps the authoritative PG Log, we can construct a set of out-of-date objects to recover Otherwise, we do not know which objects are out-ofdate, and we must perform Backfill 16
Recovery, Backfill, and PG Temp If we do not know which objects are invalid, we certainly cannot serve reads from such a peer If after peering the primary determines that it or any other peer requires Backfill, it will request that the Monitor cluster publish a new map with an exception to the CRUSH mapping for this PG mapping it to the best set of up-to-date peers that it can find. Once that map is published, peering will happen again, and the up-to-date peers will independently conclude that they should serve reads and writes while concurrently backfilling the correct peers 17
Backfill Example Epoch 1130: 0 1 2 18
Backfill Example Epoch 1130: 0 1 2 Epoch 1340: 3 4 5 19
Backfill Example Epoch 1130: 0 1 2 Epoch 1340: 3 4 5 Epoch 1345: 0 1 2 20
Backfill Example Epoch 1130: 0 1 2 Epoch 1340: 3 4 5 Epoch 1345: 0 1 2 Epoch 1391: 3 4 5 21
Backfill With PG Temp Priming Epoch 1130: 0 1 2 Epoch 1340: 3 4 5 22
Backfill With PG Temp Priming Epoch 1130: 0 1 2 Epoch 1340: 3 4 5 Epoch 1345: 0 1 2 23
Backfill With PG Temp Priming Epoch 1130: 0 1 2 Epoch 1340: 0 1 2 Epoch 1391: 3 4 5 24
PG Temp Priming Caveats Puts stress on the Monitors since the set of PGs is linear in the size of the cluster. Strategy: the Monitors spend bounded time scanning for these cases and give up when the time limit expires When a new map only involves s being weighted down or marked down, we can get away with only scanning those PGs which currently map to those s When a new map involves s being weighted up or marked up, we need to scan all PGs (can't predict which ones moved to those s). The s are perfectly happy whether the PG Temp mapping is pre-set or not. If not, they simply fall back to the old behavior. 25
PG Temp Priming Config options mon_osd_prime_pg_temp mon_osd_prime_pg_temp_max_time Enabled by default in Jewel Thanks Sage! 26
Cache/Tiering and Jewel
CEPH AND CACHING APPLICATION CACHE POOL BACKING POOL CEPH STORAGE CLUSTER 28
RADOS TIERING PRINCIPLES Each tier is a RADOS pool May be replicated or erasure coded Tiers are durable e.g., replicate across SSDs in multiple hosts Each tier has its own CRUSH policy e.g., map cache pool to SSDs devices/hosts only librados adapts to tiering topology Transparently direct requests accordingly e.g., to cache No changes to RBD, RGW, CephFS, etc. 29
READ (CACHE HIT) CEPH CLIENT READ READ REPLY CACHE POOL (SSD): WRITEBACK BACKING POOL (HDD) CEPH STORAGE CLUSTER 30
READ (CACHE MISS) CEPH CLIENT READ READ REPLY CACHE POOL (SSD): WRITEBACK PROMOTE BACKING POOL (HDD) CEPH STORAGE CLUSTER 31
READ REDIRECT (CACHE MISS) CEPH CLIENT READ REDIRECT READ READ REPLY CACHE POOL (SSD): WRITEBACK BACKING POOL (HDD) CEPH STORAGE CLUSTER 32
READ PROXY (CACHE MISS) CEPH CLIENT READ READ REPLY CACHE POOL (SSD): WRITEBACK PROXY READ PROMOTE? BACKING POOL (HDD) CEPH STORAGE CLUSTER 33
WRITE INTO CACHE POOL CEPH CLIENT WRITE ACK CACHE POOL (SSD): WRITEBACK BACKING POOL (HDD) CEPH STORAGE CLUSTER 34
WRITE (MISS) CEPH CLIENT WRITE ACK CACHE POOL (SSD): WRITEBACK PROMOTE BACKING POOL (HDD) CEPH STORAGE CLUSTER 35
WRITE PROXY (MISS) CEPH CLIENT WRITE ACK CACHE POOL (SSD): WRITEBACK PROXY WRITE BACKING POOL (HDD) CEPH STORAGE CLUSTER 36
ERASURE CODING AND FAST READS
ERASURE CODING OBJECT OBJECT COPY COPY COPY 1 2 3 4 X Y 38 REPLICATED POOL CEPH STORAGE CLUSTER Full copies of stored objects Very high durability 3x (200% overhead) Quicker recovery ERASURE CODED POOL CEPH STORAGE CLUSTER One copy plus parity Cost-effective durability 1.5x (50% overhead) Expensive recovery
ERASURE CODING SHARDS OBJECT 1 2 3 4 X Y ERASURE CODED POOL CEPH STORAGE CLUSTER 39
ERASURE CODING SHARDS 1 2 3 4 X Y 0 1 2 3 A A' 4 5 6 7 B B' 8 9 10 9 C C' 12 13 14 15 D D' 16 17 18 19 E E' Variable stripe size (e.g., 4 KB) CEPH STORAGE CLUSTER Zero-fill shards (logically) in partial tail stripe 40
PRIMARY 1 2 3 4 X Y ERASURE CODED POOL CEPH STORAGE CLUSTER 41
EC WRITE CEPH CLIENT WRITE 1 2 3 4 X Y ERASURE CODED POOL CEPH STORAGE CLUSTER 42
EC WRITE CEPH CLIENT WRITE 1 2 3 4 X Y WRITES ERASURE CODED POOL CEPH STORAGE CLUSTER 43
EC WRITE CEPH CLIENT WRITE ACK 1 2 3 4 X Y ERASURE CODED POOL CEPH STORAGE CLUSTER 44
EC WRITE: DEGRADED WRITE CEPH CLIENT 1 2 3 4 X Y WRITES ERASURE CODED POOL CEPH STORAGE CLUSTER 45
EC WRITE: PARTIAL FAILURE WRITE CEPH CLIENT 1 2 3 4 X Y WRITES ERASURE CODED POOL CEPH STORAGE CLUSTER 46
EC WRITE: PARTIAL FAILURE CEPH CLIENT 1 2 3 4 X Y B B A B A A ERASURE CODED POOL CEPH STORAGE CLUSTER 47
EC RESTRICTIONS Overwrite in place will not work in general Log and 2PC would increase complexity, latency We chose to restrict allowed operations create append (on stripe boundary) remove (keep previous generation of object for some time) These operations can all easily be rolled back locally create delete append truncate remove roll back to previous generation Object attrs preserved in existing PG logs (they are small) Key/value data is not allowed on EC pools 48
EC READ CEPH CLIENT READ 1 2 3 4 X Y ERASURE CODED POOL CEPH STORAGE CLUSTER 49
EC READ CEPH CLIENT READ 1 2 3 4 X Y READS ERASURE CODED POOL CEPH STORAGE CLUSTER 50
EC READ READ REPLY CEPH CLIENT 1 2 3 4 X Y ERASURE CODED POOL CEPH STORAGE CLUSTER 51
EC READ LONG TAIL READ CEPH CLIENT 1 2 3 4 X Y READS ERASURE CODED POOL CEPH STORAGE CLUSTER 52
EC READ LONG TAIL READ CEPH CLIENT 1 2 3 4 X Y READS ERASURE CODED POOL CEPH STORAGE CLUSTER 53
EC READ LONG TAIL READ REPLY CEPH CLIENT 1 2 3 4 X Y READS ERASURE CODED POOL CEPH STORAGE CLUSTER 54
EC FAST READ CEPH CLIENT READ 1 2 3 4 X Y READS ERASURE CODED POOL CEPH STORAGE CLUSTER 55
EC FAST READ READ REPLY CEPH CLIENT 1 2 3 4 X Y READS ERASURE CODED POOL CEPH STORAGE CLUSTER 56
EC FAST READ fast_read can be enabled on a pool using: ceph osd pool set <pool_name> fast_read true If mon_osd_pool_ec_fast_read is set to true, the mons will create ec pools with fast_read enabled by default. Thanks to Guang Yang at Yahoo! 57
ROADMAP