Understanding Write Behaviors of Storage Backends in Ceph Object Store

Understanding Write Behaviors of Storage Backends in Object Store Dong-Yun Lee, Kisik Jeong, Sang-Hoon Han, Jin-Soo Kim, Joo-Young Hwang and Sangyeun Cho

How Amplifies Writes client Data Store, please I can provide a block device service for you! 2

How Amplifies Writes client Data Describe data Make updates atomic and consistent 3

How Amplifies Writes client Data By local file system (XFS) 4

How Amplifies Writes client Data Data Data Replication for reliability 5

How Amplifies Writes client Single write can make several hidden I/Os What about in long-term situation? Data Data Data 6

How Much Does Amplifies Writes Data WAF 0 2 4 6 8 10 12 14 16 FileStore data 7

How Much Does Amplifies Writes Data WAF 0 2 4 6 8 10 12 14 16 FileStore data 8

How Much Does Amplifies Writes Data WAF 0 2 4 6 8 10 12 14 16 FileStore data 9

How Much Does Amplifies Writes Data WAF 0 2 4 6 8 10 12 14 16 FileStore data 10

How Much Does Amplifies Writes Data WAF 0 2 4 6 8 10 12 14 16 FileStore data 11

How Much Does Amplifies Writes Data WAF 0 2 4 6 8 10 12 14 16 FileStore data 12

How Much Does Amplifies Writes Data 3 0.702 5.994 0.419 3.635 Writes amplified by over 13x 0 2 4 6 8 10 12 14 16 FileStore data 13

Our Motivation & Goal How/Why are writes highly amplified in? With the exact numbers Why do we focus on write amplification? WAF(Write amplification Factor) affects the overall performance When using SSDs, it hurts the lifespan of SSDs Larger WAF smaller effective bandwidth Redundant ing of may exist Goals of this paper Understanding of write behaviors of Empirical study of write amplification in 14

Outline Introduction Background Evaluation environment Result & Analysis Conclusion 15

Architecture of RADOS Block Device (RBD) client MONitor Daemon Object Storage Daemon public network MON OSD #0 OSD #1 OSD #2 OSD #3 16

Architecture of RADOS Block Device (RBD) I/O Stack D /dev/rbd CEPH RADOS Block Device librbd, librados public network MON OSD #0 OSD #1 OSD #2 OSD #3 17

Architecture of RADOS Block Device (RBD) I/O Stack Object (4MB) /dev/rbd D librbd, librados public network MON OSD #0 OSD #1 OSD #2 OSD #3 18

Architecture of RADOS Block Device (RBD) I/O Stack /dev/rbd D librbd, librados Gets which OSD the object should be sent to (CRUSH algorithm) public network MON OSD #0 OSD #1 OSD #2 OSD #3 19

Architecture of RADOS Block Device (RBD) OSD #0 OSD #1 D ReplicationOSD 3x #2 OSD #4 OSD #10 Data Attributes OSD Internal layers Key-Value Data FileStore KStore BlueStore * XFS Storage *Figure Courtesy: Sébastien Han https://www.sebastien-han.fr 20

Architecture of RADOS Block Device (RBD) OSD #0 OSD #1 OSD #2 OSD #4 OSD #10 OSD Transaction Internal D layers Storage Backends (Object Store) FileStore KStore BlueStore * XFS Storage *Figure Courtesy: Sébastien Han https://www.sebastien-han.fr 21

Storage Backends: (1) FileStore Manages each object as a file Write-Ahead Journaling Objects Metadata Attributes Write flow in FileStore 1. Write-Ahead ing For consistency and performance Journal LevelDB XFS file system DB WAL 2. Performs actual write to file after ing Can be absorbed by page cache data FS FS 3. Calls syncfs + flush entries for every 5 seconds NVMe SSD HDDs or SSDs <Breakdown of FileStore> 22

Storage Backends: (2) KStore Using existing key-value stores Encapsulates everything to keyvalue pair Supports LevelDB, RocksDB and Kinetic Store Write flow in KStore 1. Simply calls key-value APIs with the key-value pair LevelDB or RocksDB XFS file system data Compaction Objects, Metadata, Attributes HDDs or SSDs DB WAL FS FS <Breakdown of KStore> 23

Storage Backends: (3) BlueStore To avoid limitations of FileStore Double-write issue due to ing Directly stores data to raw device Objects RocksDB Metadata Attributes DB WAL Write flow in BlueStore 1. Puts data into raw block device 2. Sets to RocksDB on BlueFS (user-level file system) Raw Device data Zero-filled data BlueFS RocksDB DB BlueFS RocksDB WAL HDDs or SSDs NVMe SSD NVMe SSD <Breakdown of BlueStore> 24

Outline Introduction Background Evaluation environment Result & Analysis Conclusion 25

H/W Evaluation Environment Admin Server / Client (x1) Model DELL R730XD Processor Intel Xeon CPU E5-2640 v3 Memory 128 GB OSD Servers (x4) Storage Network Public Network Model Processor Memory Storage DELL R730 Intel Xeon CPU E5-2640 v3 32 GB HGST UCTSSC600 600 GB x4 Samsung PM1633 960 GB x4 Intel 750 series 400 GB x2 Switch (x2) Public Network DELL N4032 10Gbps Ethernet Storage Network Mellanox SX6012 40Gbps InfiniBand 26

S/W Evaluation Environment Linux 4.4.43 kernel Some modifications to collect and classify write requests Jewel LTS version (v10.2.5) From client side Configures a 64GiB KRBD 4 OSDs per OSD server (total 16 OSDs) Generates workloads using fio while collecting trace and diskstats Perform 2 different workloads Microbenchmark Long-term workload 27

Outline Introduction Background Evaluation environment Result & Analysis Conclusion 28

Key Findings in Microbenchmark FileStore Large WAF when write request size is small (over 40 at 4KB write) WAF converges to 6 (3x by replication, 2x by ing) KStore Sudden WAF jumps due to memtable flush WAF converges to 3 (by replication) BlueStore Because the minimum extent size is set to 64KB Zero-filled data for the hole within the object WAL (Write-Ahead Logging) for small overwrite to guarantee durability WAF converges to 3 (by replication) 29

Long-Term Workload Methodology We focus on 4KB random writes In VDI, most of the writes are random and 4KB in size Workload scenario 1. Install and create a krbd partition 2. Drop page cache, call sync and wait for 600 secs 3. Issue 4KB random writes with QD=128 until the total write amount reaches 90% of the capacity (57.6GiB) Run tests with 16 HDDs first and repeat with 16 SSDs Calculate and breakdown WAF for given time period 30

Long-term Results: (1) FileStore <HDD> <SSD> Write Amount (GB/15sec) Write Amount (GB/3sec) 4 3 2 1 0 8 6 4 2 0 first absorbs random writes 0 0 Performance 2025 throttling 4051 6077 8132 due to slow HDDs Time (sec) 0 93 186 280 379 Time (sec) Large fluctuation due to repeated throttling No throttling: SAS SSDs with XFS are enough fast 0 5000 4000 3000 2000 1000 45000 30000 15000 data + IOPS IOPS (ops/sec) IOPS (ops/sec) 31

Long-term Results: (2) KStore (RocksDB) <HDD> <SSD> Write Amount (GB/15sec) Write Amount (GB/3sec) 9 8 7 6 5 4 3 2 1 0 12 10 8 6 4 2 0 0 0 4711 9422 14134 18905 Time (sec) Fluctuation gets worse 0 0 424 848 1272 1700 Time (sec) 5000 4000 3000 2000 1000 45000 30000 15000 IOPS (ops/sec) Huge Compaction compaction traffic Traffic in both cases IOPS (ops/sec) data Compaction IOPS 32

Long-term Results: (3) BlueStore <HDD> <SSD> Write Amount (GB/15sec) Zero-filled data traffic is huge at fist Write Amount (GB/3sec) 7 6 5 4 3 2 1 0 20 15 10 5 0 Since allocating extents is in sequential order, 5000 first incoming writes become sequential writes 4000 0 0 2251 4501 6752 9032 SSDs have Time superior (sec) random write performance 0 0 105 210 316 421 Time (sec) 45000 30000 15000 data RocksDB RocksDB WAL Compaction Zero-filled data IOPS 3000 2000 1000 IOPS (ops/sec) IOPS (ops/sec) 33

Overall Results of WAF KStore (RocksDB) FileStore BlueStore HDD SSD HDD SSD HDD SSD 0 5 10 15 20 25 30 35 60 40 70 45 80 data Compaction Zero-filled data WAF 34

Overall Results of WAF KStore (RocksDB) FileStore BlueStore HDD SSD HDD WAF for is about 6 in both cases, not 3 triples the write traffic SSD HDD SSD FS + FS = 4.2 (HDD), 4.054 (SSD) 0 5 10 15 20 25 30 35 60 40 70 45 80 data Compaction Zero-filled data WAF 35

Overall Results of WAF KStore (RocksDB) FileStore BlueStore HDD SSD HDD SSD HDD SSD Huge compaction traffic matters 0 5 10 15 20 25 30 35 60 40 70 45 80 data Compaction Zero-filled data WAF 36

Overall Results of WAF KStore (RocksDB) FileStore HDD SSD HDD SSD WAF is still larger than FileStore s even without zero-filled data BlueStore HDD SSD Compaction traffic for RocksDB is not ignorable 0 5 10 15 20 25 30 35 60 40 70 45 80 data Compaction Zero-filled data WAF 37

Conclusion Writes are amplified by more than 13x in No matter which storage backend is used FileStore External ing triples write traffic overhead exceeds the original data traffic KStore Suffers huge compaction overhead BlueStore Small write requests are logged on RocksDB WAL Unignorable zero-filled data & compaction traffic Thank you! 38