Block Storage Service: Status and Performance

Similar documents
SoftNAS Cloud Performance Evaluation on AWS

Extremely Fast Distributed Storage for Cloud Service Providers

SurFS Product Description

SoftNAS Cloud Performance Evaluation on Microsoft Azure

Ceph-based storage services for Run2 and beyond

Deterministic Storage Performance

Deterministic Storage Performance

Understanding Write Behaviors of Storage Backends in Ceph Object Store

MySQL and Ceph. A tale of two friends

Hyper-converged infrastructure with Proxmox VE virtualization platform and integrated Ceph Storage.

Supermicro All-Flash NVMe Solution for Ceph Storage Cluster

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

A fields' Introduction to SUSE Enterprise Storage TUT91098

Is Open Source good enough? A deep study of Swift and Ceph performance. 11/2013

Introducing SUSE Enterprise Storage 5

The Comparison of Ceph and Commercial Server SAN. Yuting Wu AWcloud

Ceph BlueStore Performance on Latest Intel Server Platforms. Orlando Moreno Performance Engineer, Intel Corporation May 10, 2018

White Paper Features and Benefits of Fujitsu All-Flash Arrays for Virtualization and Consolidation ETERNUS AF S2 series

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

Emulex LPe16000B 16Gb Fibre Channel HBA Evaluation

NetVault Backup Client and Server Sizing Guide 2.1

Deep Learning Performance and Cost Evaluation

Using persistent memory and RDMA for Ceph client write-back caching Scott Peterson, Senior Software Engineer Intel

Introduction to Ceph Speaker : Thor

Enterprise Ceph: Everyway, your way! Amit Dell Kyle Red Hat Red Hat Summit June 2016

White Paper. File System Throughput Performance on RedHawk Linux

Duy Le (Dan) - The College of William and Mary Hai Huang - IBM T. J. Watson Research Center Haining Wang - The College of William and Mary

SolidFire and Ceph Architectural Comparison

Deep Learning Performance and Cost Evaluation

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

The Oracle Database Appliance I/O and Performance Architecture

NetVault Backup Client and Server Sizing Guide 3.0

A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510

A Gentle Introduction to Ceph

Benchmarking Ceph for Real World Scenarios

Performance Modeling and Analysis of Flash based Storage Devices

CERN European Organization for Nuclear Research, 1211 Geneva, CH

IBM System Storage DCS3700

Current Status of the Ceph Based Storage Systems at the RACF

RED HAT CEPH STORAGE ON THE INFINIFLASH ALL-FLASH STORAGE SYSTEM FROM SANDISK

Improving Ceph Performance while Reducing Costs

Tiered IOPS Storage for Service Providers Dell Platform and Fibre Channel protocol. CloudByte Reference Architecture

IBM Tivoli Storage Manager for Windows Version Installation Guide IBM

Optimizing Fusion iomemory on Red Hat Enterprise Linux 6 for Database Performance Acceleration. Sanjay Rao, Principal Software Engineer

Aerospike Scales with Google Cloud Platform

Open vstorage RedHat Ceph Architectural Comparison

MySQL and Ceph. MySQL in the Cloud Head-to-Head Performance Lab. 1:20pm 2:10pm Room :20pm 3:10pm Room 203

클라우드스토리지구축을 위한 ceph 설치및설정

A product by CloudFounders. Wim Provoost Open vstorage

vsan 6.6 Performance Improvements First Published On: Last Updated On:

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

Accelerate Applications Using EqualLogic Arrays with directcache

Demartek Evaluation Accelerated Business Results with Seagate Enterprise Performance HDDs

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces

Executive Brief June 2014

Got Isilon? Need IOPS? Get Avere.

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Building Service Platforms using OpenStack and CEPH: A University Cloud at Humboldt University

BENEFITS AND BEST PRACTICES FOR DEPLOYING SSDS IN AN OLTP ENVIRONMENT USING DELL EQUALLOGIC PS SERIES

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage

EMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE

HP SAS benchmark performance tests

Demartek Evaluation Accelerate Business Results with Seagate EXOS 15E900 and 10E2400 Hard Drives

THE CEPH POWER SHOW. Episode 2 : The Jewel Story. Daniel Messer Technical Marketing Red Hat Storage. Karan Singh Sr. Storage Architect Red Hat Storage

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

IBM Emulex 16Gb Fibre Channel HBA Evaluation

NVMe SSDs Future-proof Apache Cassandra

Red Hat OpenStack Platform on Red Hat Ceph Storage

Ceph in a Flash. Micron s Adventures in All-Flash Ceph Storage. Ryan Meredith & Brad Spiers, Micron Principal Solutions Engineer and Architect

What is QES 2.1? Agenda. Supported Model. Live demo

Paperspace. Architecture Overview. 20 Jay St. Suite 312 Brooklyn, NY Technical Whitepaper

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

StorPool Distributed Storage Software Technical Overview

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

Webtalk Storage Trends

Extreme Storage Performance with exflash DIMM and AMPS

Scality RING on Cisco UCS: Store File, Object, and OpenStack Data at Scale

Database Solutions Engineering. Best Practices for Deploying SSDs in an Oracle OLTP Environment using Dell TM EqualLogic TM PS Series

NAS for Server Virtualization Dennis Chapman Senior Technical Director NetApp

Hitachi Unified Storage VM Dynamically Provisioned 21,600 Mailbox Exchange 2013 Mailbox Resiliency Storage Solution

KVM PERFORMANCE OPTIMIZATIONS INTERNALS. Rik van Riel Sr Software Engineer, Red Hat Inc. Thu May

SPC BENCHMARK 1 EXECUTIVE SUMMARY IBM CORPORATION IBM STORWIZE V7000 (SSDS) SPC-1 V1.12

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

IBM and HP 6-Gbps SAS RAID Controller Performance

virtual machine block storage with the ceph distributed storage system sage weil xensummit august 28, 2012

Assessing performance in HP LeftHand SANs

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

High-Performance Lustre with Maximum Data Assurance

EMC VMAX 400K SPC-2 Proven Performance. Silverton Consulting, Inc. StorInt Briefing

Applying Polling Techniques to QEMU

Price Performance Analysis of NxtGen Vs. Amazon EC2 and Rackspace Cloud.

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Performance Testing December 16, 2017

Red Hat Ceph Storage Ceph Block Device

Key metrics for effective storage performance and capacity reporting

Transcription:

Block Storage Service: Status and Performance Dan van der Ster, IT-DSS, 6 June 2014 Summary This memo summarizes the current status of the Ceph block storage service as it is used for OpenStack Cinder Volumes and Glance Images as of May 2014. We present the block storage activity on the current cluster, measuring IOPS and latencies, and present a cost/benefit analysis of using SSDs to optimize the cost and performance efficiency of the service. During tests in collaboration with IT CF, we have concluded that by adding SSDs as the synchronous write journals (used to guarantee data durability), we are able increase the IOPS capacity by 4 5 times, at a cost of decreasing the available volume by 20%. Further, the testing has shown that the Ceph implementation is able to operate at the limit of the hardware performance; software induced performance limitations were not yet observed in either the spinning disk or SSD configurations. In addition, we believe that increasing small write performance with SSDs is applicable only to the block storage use case; high bandwidth use cases such as physics data storage should not require SSDs. Configuration The current production block storage cluster is running the latest Inktank Ceph Enterprise version 1.1 (equivalent to Ceph 0.67.9). It is composed of 48 disk servers (hardware 1 described below ) resulting in a total raw capacity of roughly 3PB. Seven of the 48 servers are not in production; they are used for tests/preprod work, and one of the servers was DoA and not yet repaired. Servers are grouped logically by racks, with data being replicated across 4 racks; this policy allows for service continuity in the case of the simultaneous disk or host failure in any 3 racks. 2 Note that we have calculated that 3x replication would provide adequate data durability, so in the near future that change will be implemented. 3 Ceph guarantees data consistency using write ahead journalling. The current configuration co locates the journal on the same drive as the XFS filestore. The best practices documentation recommends using SSDs for the write journals, but until now we have not explored this option in production. 1 2x Xeon E5 2650, 64GB RAM, 1x LSI SAS2008 HBA, 24x Hitachi HUA5C303 3TB drives, 3x Hitachi HUA72302 2TB system drives. 2 The Ceph reliability calculator computes 9 nines for 3x RADOS replication and 12 nines for 4x RADOS replication. 3 In short, Ceph acknowledges writes only after they have written synchronously to a journal (i.e. a direct IO file or block device), followed by a buffered write to the filestore (i.e. XFS) which is flushed periodically.

The original plan for our block storage service was to evaluate the usage of our standard EOS disk server hardware. We now have enough experience with this hardware type to understand its limitations with this new use case and can recommend a way forward to unlock increased performance. On paper, the current hardware has known IOPS limitations. Assuming 130 IOPS per OSD 4 drive, with 40 servers the cluster supports 130 IOPS * 40 * 24 = 124 800 IOPS. With 4x replication this drops to 31 200 writes/sec, and with each disk being used for both the journal and the filestore, the final capacity is at worst roughly 15 000 writes/sec. Assuming a 50% read/write ratio in the cluster, and with the knowledge that reads are much less IO intensive 5 on the OSDs than writes, the total IOPS capacity of the cluster is estimated at between 20 000 and 30 000. (With 3x replication is roughly 41 600 writes/sec to handle the journal and filestore writes the other numbers would scale up accordingly). OpenStack qemu kvm hypervisors connect to the service via librbd, a user mode plugin for the RADOS Block Device (RBD) component of Ceph. Librbd can be configured with a disk like write back cache currently the hypervisors use a 32MB writeback cache. Qemu kvm also supports the throttling of block device IOs the configuration up until 20.05.2014 allowed 200 write iops, 400 read iops, 40MB/s written, and 80MB/s read, all per attached block device. From 20.05.2014 volumes are throttled to 100 IOPS read and write and 80 MB/s read and write. Status Ceph has been used for the Glance image service since fall 2013, and the Cinder volume service was gradually offered to beta testers starting in late 2013, then to our IT colleagues in February 2014, and finally becoming general available in March. 6 The cluster is instrumented and a variety of metrics are displayed in SLS. The metrics show linear growth towards the current status of more than 400 Cinder volumes with 140TB provisioned space, which is stored in Ceph as 55TB of written data, occupying 220TB on disk. In addition, we monitor all reads and writes to the cluster and can summarize the access patterns to learn that out of roughly 320 million writes per day, around 75% are 4096 bytes in length hence the cluster is confirmed to be dominated by small writes. The Problem: Limited IOPS Capacity and Increasing Latencies One notable SLS metric shows the increasing operations (read + write) per second as reported by the Ceph monitors. 4 Measured by fio with 4kB random writes, ioengine=libaio, and direct=1. 5 Reads are sent to only one of the replicas, and are often already cached. 6 http://sls.cern.ch/sls/history.php?id=ceph&more=all&period=24h

During the past two months, increased usage of the service has resulted in peaks of up to 7 to 8000 IOPS reported by Ceph. Now note how these increased IOPS lead to increased latency (mean 64k write latency in ms): When the cluster was not heavily used, the mean latency was 50ms. A configuration change in week 18 allowed for 40ms, but heavy usage of the cluster since week 19 has resulted in >60ms latency, with peaks up to 100ms. Looking forward, we expect that this usage will continue to grow in the coming weeks/months, the IOPS capacity of the cluster, and in particular the small writes capacity, will soon limit its growth and performance may start to suffer. And since we are already observing a high rate of small writes with an occupied space of ~200TB, it is clear that the IOPS capacity will be exceeded well before the full 3PB of storage can be significantly occupied. Solution: SSD Journals 7 The best practices recommended in the Inktank Hardware Suggestion Guide suggest the usage of SSDs for the write journal at a ratio of up to 5 OSDs to 1 SSD. The Intel DC S3700 or equivalent is recommended for its performance and long term durability. Since the write 7 http://www.inktank.com/resources/inktank hardware selection guide/

journal uses synchronous writes, a fast journalling device can help with the IOPS performance. (The write journal is analogous to an NVRAM write cache in a NetApp filer, or the ZIL in a ZFS system, to take two examples). When using an SSD journal, we should expect at least a 100% increase in IOPS performance (due to halving the spinning disk writes), however larger short term burst increase may be possible. We have studied the expected IOPS per server with sample units provided by IT CF. With IT CF we have built one server with 4x SSDs and 20x 3TB drives to compare its per server performance versus the current disk only configuration. Using fio with its new RBD 8 driver we have measured the 4kB write IOPS capacity and related latencies for various IO depths to the sample OSD servers. (For example, with iodepth=1, fio will keep a single IO in the libaio queue; this will not heavily load the OSD server. But with iodepth=128, fio will keep the queue full with 128 outstanding writes, which creates more parallelism on the OSDs and better evaluates the IOPS capacity and latencies). In all cases, we use 3x replication across OSDs but within the same physical server, and client side write back caching is disabled. Below we highlight the main results of these tests. Above, we have shown the completion latencies for varied IO depths to an OSD servers having SSD or spinning disk journals. With a disk only OSD server, fio achieves 476 IOPS with a mean latency of 35 ms when using iodepth=16. With SSD based journals, that same latency is achieved while writing 3660 IOPS. Writing more than ~500 IOPS is not practical for the disk only server; when writing at 734 IOPS, the mean latency was 173 ms. 8 http://telekomcloud.github.io/ceph/2014/02/26/ceph performance analysis_fio_rbd.html

We look closer at the configurations having mean latency less than 50 ms. Here we see that the SSD configurations have narrow distributions, while the disk configuration is wide. The next plot highlights this point. In the CDF plot below we have highlighted the 95th percentile of write completions. The SSD configurations complete 95% of the writes in under 23 ms, 33 ms, and 48 ms, respectively. The disk,16 configuration, despite having a mean latency similar to the SSDs (though, at lower IOPS) requires 87 ms to complete 95% of the writes.

Finally, we plot a longer test of the ssd,64 configuration to check for slowdowns due to OSD flushes or fsync barriers. We can confirm that (at least in this configuration) 2400 writes/sec can be sustained for more than 200s, which exceeds the Linux kernel and Ceph flush intervals. Concluding Discussion and Costs In the tests above we observe that the SSD journal servers can deliver 2000 3000 writes/sec without a large increase in latency. Scaled out to the entire cluster (40 servers in production) this would allow bursts of up to 80 000 to 120 000 writes/sec across the cluster, a factor of 4 6 above the current configuration. Note that one limitation of these tests was that long term (many hour) tests were not performed. Effects which only become apparent at those time scales (e.g. interference with background scrubbing, SMART, or other disk intensive activities) may decrease the overall IOPS capacity. 9 The Intel DC S3700 200GB has a retail price of 450CHF. Four SSDs per server would increase the cost by 1800CHF (minus the cost of 4 spinning disks) while decreasing the raw capacity by 12TB (16.7%). The 100GB Intel DC S3700 (with half the cost) would have adequate volume to be used for this purpose, however these have a write limitation of 200MB/s, which would prove to be a bottleneck in the throughput to the servers. 9 http://www.toppreise.ch/prod_301951.html

In future, we may also consider providing pools for high IOPS use cases, using SSD only 10 configurations. To test this scenario, I propose to outfit all 48 Ceph servers with the four SSDs; then the eight preprod servers may then be temporarily configured without spinning disk OSDs for benchmarking a provisioned IOPS like service. The total cost of this operation is therefore estimated at 450CHF * 48 * 4 = 86 400CHF minus taxes and volume discounts. One unknown in the future expansion of the block storage cluster is how it would perform with single pools having mixed resources: some SSD/disk servers and some disk only servers. Without running tests, we can predict that IOs to objects with at least one (out of three/four) replicas on the disk only server would be penalized, with the synchronous write to the disk only replica becoming a bottleneck. Thus, in a cluster with 90% SSD/disk and 10% disk only servers, we estimate that up to 30% of IOs would be affected by the disk only IOPS limitation (affecting 100% of Cinder Volumes). 10 This would be similar to the Amazon EBS Provisioned IOPS service, where users can pay extra for 1000/2000/3000 IOPS volumes