Understanding Write Behaviors of Storage Backends in Ceph Object Store

Similar documents
Understanding Write Behaviors of Storage Backends in Ceph Object Store

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

FStream: Managing Flash Streams in the File System

Ceph BlueStore Performance on Latest Intel Server Platforms. Orlando Moreno Performance Engineer, Intel Corporation May 10, 2018

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

Accelerate Ceph By SPDK on AArch64

Block Storage Service: Status and Performance

THE CEPH POWER SHOW. Episode 2 : The Jewel Story. Daniel Messer Technical Marketing Red Hat Storage. Karan Singh Sr. Storage Architect Red Hat Storage

INTRODUCTION TO CEPH. Orit Wasserman Red Hat August Penguin 2017

Ceph in a Flash. Micron s Adventures in All-Flash Ceph Storage. Ryan Meredith & Brad Spiers, Micron Principal Solutions Engineer and Architect

SFS: Random Write Considered Harmful in Solid State Drives

Ceph Optimizations for NVMe

Accelerate block service built on Ceph via SPDK Ziye Yang Intel

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Yuan Zhou Chendi Xue Jian Zhang 02/2017

GOODBYE, XFS: BUILDING A NEW, FASTER STORAGE BACKEND FOR CEPH SAGE WEIL RED HAT

What's new in Jewel for RADOS? SAMUEL JUST 2015 VAULT

Solid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

The Unwritten Contract of Solid State Drives

LevelDB-Raw: Eliminating File System Overhead for Optimizing Performance of LevelDB Engine

SPDK Blobstore: A Look Inside the NVM Optimized Allocator

MySQL and Ceph. A tale of two friends

A New Key-Value Data Store For Heterogeneous Storage Architecture

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

Is Open Source good enough? A deep study of Swift and Ceph performance. 11/2013

Managing Array of SSDs When the Storage Device is No Longer the Performance Bottleneck

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

A Gentle Introduction to Ceph

Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems

WHAT S NEW IN LUMINOUS AND BEYOND. Douglas Fuller Red Hat

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Understanding Manycore Scalability of File Systems. Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo Kim

Ceph Block Devices: A Deep Dive. Josh Durgin RBD Lead June 24, 2015

Lightweight Application-Level Crash Consistency on Transactional Flash Storage

Presented by: Nafiseh Mahmoudi Spring 2017

HashKV: Enabling Efficient Updates in KV Storage via Hashing

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

virtual machine block storage with the ceph distributed storage system sage weil xensummit august 28, 2012

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Extremely Fast Distributed Storage for Cloud Service Providers

Protecting the Galaxy Multi-Region Disaster Recovery with OpenStack and Ceph

Using persistent memory and RDMA for Ceph client write-back caching Scott Peterson, Senior Software Engineer Intel

SLM-DB: Single-Level Key-Value Store with Persistent Memory

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Computer Systems Laboratory Sungkyunkwan University

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

MySQL and Ceph. MySQL in the Cloud Head-to-Head Performance Lab. 1:20pm 2:10pm Room :20pm 3:10pm Room 203

Deep Learning Performance and Cost Evaluation

Building a High IOPS Flash Array: A Software-Defined Approach

Current Status of the Ceph Based Storage Systems at the RACF

Improving Ceph Performance while Reducing Costs

Ben Walker Data Center Group Intel Corporation

Deep Learning Performance and Cost Evaluation

Ceph Intro & Architectural Overview. Federico Lucifredi Product Management Director, Ceph Storage Vancouver & Guadalajara, May 18th, 2015

Ceph Intro & Architectural Overview. Abbas Bangash Intercloud Systems

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

SolidFire and Ceph Architectural Comparison

A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510

Hyper-converged infrastructure with Proxmox VE virtualization platform and integrated Ceph Storage.

MAHA. - Supercomputing System for Bioinformatics

Solid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

RocksDB Key-Value Store Optimized For Flash

Supermicro All-Flash NVMe Solution for Ceph Storage Cluster

Barrier Enabled IO Stack for Flash Storage

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

A fields' Introduction to SUSE Enterprise Storage TUT91098

CS3600 SYSTEMS AND NETWORKS

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Request-Oriented Durable Write Caching for Application Performance

ROCK INK PAPER COMPUTER

OSSD: A Case for Object-based Solid State Drives

Guide. v5.5 Implementation. Guide. HPE Apollo 4510 Gen10 Series Servers. Implementation Guide. Written by: David Byte, SUSE.

Micron 9200 MAX NVMe SSDs + Ceph Luminous BlueStore

An Efficient Memory-Mapped Key-Value Store for Flash Storage

Hyper-converged storage for Oracle RAC based on NVMe SSDs and standard x86 servers

Using Transparent Compression to Improve SSD-based I/O Caches

Hardware Based Compression in Ceph OSD with BTRFS

Red Hat Ceph Storage and Samsung NVMe SSDs for intensive workloads

Google File System. Arun Sundaram Operating Systems

D E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

Webscale, All Flash, Distributed File Systems. Avraham Meir Elastifile

Comparing UFS and NVMe Storage Stack and System-Level Performance in Embedded Systems

Performance Testing Ceph with CBT

Ceph vs Swift Performance Evaluation on a Small Cluster. edupert monthly call Jul 24, 2014

A Light-weight Compaction Tree to Reduce I/O Amplification toward Efficient Key-Value Stores

InfiniBand Networked Flash Storage

Open vstorage RedHat Ceph Architectural Comparison

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

Benchmarking Ceph's BlueStore. Niklaus Hofer Open Cloud Day

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Toward Seamless Integration of RAID and Flash SSD

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Optimizing Fsync Performance with Dynamic Queue Depth Adaptation

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

arxiv: v1 [cs.dc] 22 Feb 2018

Storage Systems : Disks and SSDs. Manu Awasthi CASS 2018

Transcription:

Understanding Write Behaviors of Storage Backends in Object Store Dong-Yun Lee, Kisik Jeong, Sang-Hoon Han, Jin-Soo Kim, Joo-Young Hwang and Sangyeun Cho

How Amplifies Writes client Data Store, please I can provide a block device service for you! 2

How Amplifies Writes client Data Describe data Make updates atomic and consistent 3

How Amplifies Writes client Data By local file system (XFS) 4

How Amplifies Writes client Data Data Data Replication for reliability 5

How Amplifies Writes client Single write can make several hidden I/Os What about in long-term situation? Data Data Data 6

How Much Does Amplifies Writes Data WAF 0 2 4 6 8 10 12 14 16 FileStore data 7

How Much Does Amplifies Writes Data WAF 0 2 4 6 8 10 12 14 16 FileStore data 8

How Much Does Amplifies Writes Data WAF 0 2 4 6 8 10 12 14 16 FileStore data 9

How Much Does Amplifies Writes Data WAF 0 2 4 6 8 10 12 14 16 FileStore data 10

How Much Does Amplifies Writes Data WAF 0 2 4 6 8 10 12 14 16 FileStore data 11

How Much Does Amplifies Writes Data WAF 0 2 4 6 8 10 12 14 16 FileStore data 12

How Much Does Amplifies Writes Data 3 0.702 5.994 0.419 3.635 Writes amplified by over 13x 0 2 4 6 8 10 12 14 16 FileStore data 13

Our Motivation & Goal How/Why are writes highly amplified in? With the exact numbers Why do we focus on write amplification? WAF(Write amplification Factor) affects the overall performance When using SSDs, it hurts the lifespan of SSDs Larger WAF smaller effective bandwidth Redundant ing of may exist Goals of this paper Understanding of write behaviors of Empirical study of write amplification in 14

Outline Introduction Background Evaluation environment Result & Analysis Conclusion 15

Architecture of RADOS Block Device (RBD) client MONitor Daemon Object Storage Daemon public network MON OSD #0 OSD #1 OSD #2 OSD #3 16

Architecture of RADOS Block Device (RBD) I/O Stack D /dev/rbd CEPH RADOS Block Device librbd, librados public network MON OSD #0 OSD #1 OSD #2 OSD #3 17

Architecture of RADOS Block Device (RBD) I/O Stack Object (4MB) /dev/rbd D librbd, librados public network MON OSD #0 OSD #1 OSD #2 OSD #3 18

Architecture of RADOS Block Device (RBD) I/O Stack /dev/rbd D librbd, librados Gets which OSD the object should be sent to (CRUSH algorithm) public network MON OSD #0 OSD #1 OSD #2 OSD #3 19

Architecture of RADOS Block Device (RBD) OSD #0 OSD #1 D ReplicationOSD 3x #2 OSD #4 OSD #10 Data Attributes OSD Internal layers Key-Value Data FileStore KStore BlueStore * XFS Storage *Figure Courtesy: Sébastien Han https://www.sebastien-han.fr 20

Architecture of RADOS Block Device (RBD) OSD #0 OSD #1 OSD #2 OSD #4 OSD #10 OSD Transaction Internal D layers Storage Backends (Object Store) FileStore KStore BlueStore * XFS Storage *Figure Courtesy: Sébastien Han https://www.sebastien-han.fr 21

Storage Backends: (1) FileStore Manages each object as a file Write-Ahead Journaling Objects Metadata Attributes Write flow in FileStore 1. Write-Ahead ing For consistency and performance Journal LevelDB XFS file system DB WAL 2. Performs actual write to file after ing Can be absorbed by page cache data FS FS 3. Calls syncfs + flush entries for every 5 seconds NVMe SSD HDDs or SSDs <Breakdown of FileStore> 22

Storage Backends: (2) KStore Using existing key-value stores Encapsulates everything to keyvalue pair Supports LevelDB, RocksDB and Kinetic Store Write flow in KStore 1. Simply calls key-value APIs with the key-value pair LevelDB or RocksDB XFS file system data Compaction Objects, Metadata, Attributes HDDs or SSDs DB WAL FS FS <Breakdown of KStore> 23

Storage Backends: (3) BlueStore To avoid limitations of FileStore Double-write issue due to ing Directly stores data to raw device Objects RocksDB Metadata Attributes DB WAL Write flow in BlueStore 1. Puts data into raw block device 2. Sets to RocksDB on BlueFS (user-level file system) Raw Device data Zero-filled data BlueFS RocksDB DB BlueFS RocksDB WAL HDDs or SSDs NVMe SSD NVMe SSD <Breakdown of BlueStore> 24

Outline Introduction Background Evaluation environment Result & Analysis Conclusion 25

H/W Evaluation Environment Admin Server / Client (x1) Model DELL R730XD Processor Intel Xeon CPU E5-2640 v3 Memory 128 GB OSD Servers (x4) Storage Network Public Network Model Processor Memory Storage DELL R730 Intel Xeon CPU E5-2640 v3 32 GB HGST UCTSSC600 600 GB x4 Samsung PM1633 960 GB x4 Intel 750 series 400 GB x2 Switch (x2) Public Network DELL N4032 10Gbps Ethernet Storage Network Mellanox SX6012 40Gbps InfiniBand 26

S/W Evaluation Environment Linux 4.4.43 kernel Some modifications to collect and classify write requests Jewel LTS version (v10.2.5) From client side Configures a 64GiB KRBD 4 OSDs per OSD server (total 16 OSDs) Generates workloads using fio while collecting trace and diskstats Perform 2 different workloads Microbenchmark Long-term workload 27

Outline Introduction Background Evaluation environment Result & Analysis Conclusion 28

Key Findings in Microbenchmark FileStore Large WAF when write request size is small (over 40 at 4KB write) WAF converges to 6 (3x by replication, 2x by ing) KStore Sudden WAF jumps due to memtable flush WAF converges to 3 (by replication) BlueStore Because the minimum extent size is set to 64KB Zero-filled data for the hole within the object WAL (Write-Ahead Logging) for small overwrite to guarantee durability WAF converges to 3 (by replication) 29

Long-Term Workload Methodology We focus on 4KB random writes In VDI, most of the writes are random and 4KB in size Workload scenario 1. Install and create a krbd partition 2. Drop page cache, call sync and wait for 600 secs 3. Issue 4KB random writes with QD=128 until the total write amount reaches 90% of the capacity (57.6GiB) Run tests with 16 HDDs first and repeat with 16 SSDs Calculate and breakdown WAF for given time period 30

Long-term Results: (1) FileStore <HDD> <SSD> Write Amount (GB/15sec) Write Amount (GB/3sec) 4 3 2 1 0 8 6 4 2 0 first absorbs random writes 0 0 Performance 2025 throttling 4051 6077 8132 due to slow HDDs Time (sec) 0 93 186 280 379 Time (sec) Large fluctuation due to repeated throttling No throttling: SAS SSDs with XFS are enough fast 0 5000 4000 3000 2000 1000 45000 30000 15000 data + IOPS IOPS (ops/sec) IOPS (ops/sec) 31

Long-term Results: (2) KStore (RocksDB) <HDD> <SSD> Write Amount (GB/15sec) Write Amount (GB/3sec) 9 8 7 6 5 4 3 2 1 0 12 10 8 6 4 2 0 0 0 4711 9422 14134 18905 Time (sec) Fluctuation gets worse 0 0 424 848 1272 1700 Time (sec) 5000 4000 3000 2000 1000 45000 30000 15000 IOPS (ops/sec) Huge Compaction compaction traffic Traffic in both cases IOPS (ops/sec) data Compaction IOPS 32

Long-term Results: (3) BlueStore <HDD> <SSD> Write Amount (GB/15sec) Zero-filled data traffic is huge at fist Write Amount (GB/3sec) 7 6 5 4 3 2 1 0 20 15 10 5 0 Since allocating extents is in sequential order, 5000 first incoming writes become sequential writes 4000 0 0 2251 4501 6752 9032 SSDs have Time superior (sec) random write performance 0 0 105 210 316 421 Time (sec) 45000 30000 15000 data RocksDB RocksDB WAL Compaction Zero-filled data IOPS 3000 2000 1000 IOPS (ops/sec) IOPS (ops/sec) 33

Overall Results of WAF KStore (RocksDB) FileStore BlueStore HDD SSD HDD SSD HDD SSD 0 5 10 15 20 25 30 35 60 40 70 45 80 data Compaction Zero-filled data WAF 34

Overall Results of WAF KStore (RocksDB) FileStore BlueStore HDD SSD HDD WAF for is about 6 in both cases, not 3 triples the write traffic SSD HDD SSD FS + FS = 4.2 (HDD), 4.054 (SSD) 0 5 10 15 20 25 30 35 60 40 70 45 80 data Compaction Zero-filled data WAF 35

Overall Results of WAF KStore (RocksDB) FileStore BlueStore HDD SSD HDD SSD HDD SSD Huge compaction traffic matters 0 5 10 15 20 25 30 35 60 40 70 45 80 data Compaction Zero-filled data WAF 36

Overall Results of WAF KStore (RocksDB) FileStore HDD SSD HDD SSD WAF is still larger than FileStore s even without zero-filled data BlueStore HDD SSD Compaction traffic for RocksDB is not ignorable 0 5 10 15 20 25 30 35 60 40 70 45 80 data Compaction Zero-filled data WAF 37

Conclusion Writes are amplified by more than 13x in No matter which storage backend is used FileStore External ing triples write traffic overhead exceeds the original data traffic KStore Suffers huge compaction overhead BlueStore Small write requests are logged on RocksDB WAL Unignorable zero-filled data & compaction traffic Thank you! 38