Ceph at DTU Risø Frank Schilder

Similar documents
A fields' Introduction to SUSE Enterprise Storage TUT91098

What's new in Jewel for RADOS? SAMUEL JUST 2015 VAULT

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

Open vstorage RedHat Ceph Architectural Comparison

A Gentle Introduction to Ceph

Summary optimized CRUSH algorithm more than 10% read performance improvement Design and Implementation: 1. Problem Identification 2.

Ceph: A Scalable, High-Performance Distributed File System PRESENTED BY, NITHIN NAGARAJ KASHYAP

Ceph Software Defined Storage Appliance

클라우드스토리지구축을 위한 ceph 설치및설정

Ceph Intro & Architectural Overview. Abbas Bangash Intercloud Systems

Cloudian Sizing and Architecture Guidelines

MySQL in the Cloud Tricks and Tradeoffs

Building Service Platforms using OpenStack and CEPH: A University Cloud at Humboldt University

SUSE Enterprise Storage Technical Overview

Ceph vs Swift Performance Evaluation on a Small Cluster. edupert monthly call Jul 24, 2014

Discover CephFS TECHNICAL REPORT SPONSORED BY. image vlastas, 123RF.com

Enterprise Ceph: Everyway, your way! Amit Dell Kyle Red Hat Red Hat Summit June 2016

Transforming PCIe-SSDs and HDDs with Infiniband into Scalable Enterprise Storage Dieter Kasper Fujitsu

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Analyzing CBT Benchmarks in Jupyter

MySQL and Ceph. A tale of two friends

Benchmark of a Cubieboard cluster

SolidFire and Ceph Architectural Comparison

Introducing SUSE Enterprise Storage 5

Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems

Block Storage Service: Status and Performance

EMC Integrated Infrastructure for VMware. Business Continuity

Deploying Ceph clusters with Salt

Building reliable Ceph clusters with SUSE Enterprise Storage

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Exam : S Title : Snia Storage Network Management/Administration. Version : Demo

Supermicro All-Flash NVMe Solution for Ceph Storage Cluster

CephFS A Filesystem for the Future

PESIT Bangalore South Campus

Cluster-Level Google How we use Colossus to improve storage efficiency

Expert Days SUSE Enterprise Storage

SUSE Enterprise Storage v4

3.3 Understanding Disk Fault Tolerance Windows May 15th, 2007

Oracle NoSQL Database

Optimizing Ceph Object Storage For Production in Multisite Clouds

Guide. v5.5 Implementation. Guide. HPE Apollo 4510 Gen10 Series Servers. Implementation Guide. Written by: David Byte, SUSE.

Storage Profiles. Storage Profiles. Storage Profiles, page 12

Operating an OpenStack Cloud

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

RED HAT CEPH STORAGE ON THE INFINIFLASH ALL-FLASH STORAGE SYSTEM FROM SANDISK

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Administering VMware Virtual SAN. Modified on October 4, 2017 VMware vsphere 6.0 VMware vsan 6.2

SEP sesam Backup & Recovery to SUSE Enterprise Storage. Hybrid Backup & Disaster Recovery

Understanding Write Behaviors of Storage Backends in Ceph Object Store

Configuring Storage Profiles

WHAT S NEW IN LUMINOUS AND BEYOND. Douglas Fuller Red Hat

Samba and Ceph. Release the Kraken! David Disseldorp

Ceph at the DRI. Peter Tiernan Systems and Storage Engineer Digital Repository of Ireland TCHPC

THE CEPH POWER SHOW. Episode 2 : The Jewel Story. Daniel Messer Technical Marketing Red Hat Storage. Karan Singh Sr. Storage Architect Red Hat Storage

dcache Ceph Integration

A FLEXIBLE ARM-BASED CEPH SOLUTION

MySQL and Ceph. MySQL in the Cloud Head-to-Head Performance Lab. 1:20pm 2:10pm Room :20pm 3:10pm Room 203

MySQL and Virtualization Guide

Assessing performance in HP LeftHand SANs

Improving Ceph Performance while Reducing Costs

INTRODUCTION TO CEPH. Orit Wasserman Red Hat August Penguin 2017

Using Cloud Services behind SGI DMF

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

Scale-Out Functionality User Guide (rev. v3 FW v and after) Important Notes:

Is Open Source good enough? A deep study of Swift and Ceph performance. 11/2013

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

RED HAT CEPH STORAGE ROADMAP. Cesar Pinto Account Manager, Red Hat Norway

GLUSTER CAN DO THAT! Architecting and Performance Tuning Efficient Gluster Storage Pools

White paper Version 3.10

Deploying Software Defined Storage for the Enterprise with Ceph. PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu

UH-Sky informasjonsmøte

The Comparison of Ceph and Commercial Server SAN. Yuting Wu AWcloud

CS-580K/480K Advanced Topics in Cloud Computing. Object Storage

SAA-C01. AWS Solutions Architect Associate. Exam Summary Syllabus Questions

ZFS The Last Word in Filesystem. frank

Architecting For Availability, Performance & Networking With ScaleIO

Datacenter Storage with Ceph

Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication

White Paper Features and Benefits of Fujitsu All-Flash Arrays for Virtualization and Consolidation ETERNUS AF S2 series

Nexenta Technical Sales Professional (NTSP)

ROCK INK PAPER COMPUTER

A High-Availability Cloud for Research Computing

Type English ETERNUS CD User Guide V2.0 SP1

PracticeDump. Free Practice Dumps - Unlimited Free Access of practice exam

IBM InfoSphere Streams v4.0 Performance Best Practices

SUSE OpenStack Cloud Production Deployment Architecture. Guide. Solution Guide Cloud Computing.

ZFS The Last Word in Filesystem. chwong

ZFS The Last Word in Filesystem. tzute

The Leading Parallel Cluster File System

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Red Hat Ceph Storage 3

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Archive Solutions at the Center for High Performance Computing by Sam Liston (University of Utah)

Administering VMware vsan. 17 APR 2018 VMware vsphere 6.7 VMware vsan 6.7

White Paper. EonStor GS Family Best Practices Guide. Version: 1.1 Updated: Apr., 2018

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

Jason Dillaman RBD Project Technical Lead Vault Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring

OpenIO SDS on ARM A practical and cost-effective object storage infrastructure based on SoYouStart dedicated ARM servers.

vsan Planning and Deployment Update 1 16 OCT 2018 VMware vsphere 6.7 VMware vsan 6.7

Transcription:

Ceph at DTU Risø Frank Schilder

Ceph at DTU Risø

Ceph at DTU Risø Design goals 1) High failure tolerance (long-term) single-disk blue store OSDs, no journal aggregation high replication value for 24/7 HA pools (2x2) medium to high parity value for erasure coded pools (6+2, 8+2, 8+3) duplication of essential hardware 2) Low storage cost use erasure coded pools as much as possible buy complete blobs design cluster to handle imbalance 3) Performance use SSDs for meta data pools choose EC coding profiles carefully small all-ssd pools for high I/O requirements utilize striping, if supported grow the cluster

Ceph at DTU Risø Mini-tender for core ceph hardware: 12 OSDs with 12x10TB HDD + 4x400GB SSD 3 MON/MGR 5 years warranty No separate management node yet No separate MDS yet (co-locate with OSD + extra RAM) No separate client nodes yet No storage network hardware yet Total raw storage: 1440TB HDD + 19.2TB SSD fair fault tolerance

Ceph at DTU Risø Outlook mid-term: 17 OSD servers (6 months) 3 MON/MGR 1 management server 2 separate MDS (1 year) Growing number of client nodes (DTU-ONE, HPC) Some dedicated storage network hardware 5 x 80 disk JBods (1-2 years) approximately 6PB raw storage good fault tolerance

Deployment Cluster deployment with OHPC Ceph container community edition Configuration management with ansible https://xkcd.com/1988/

Deployment Goal: Ceph cluster completely self-contained and all configuration data redundantly distributed (vault, etcd). MON nodes run essential distributed services MON, MGR, NTPD, ETCD Container and ceph status encodes current state of cluster. Configuration data encodes target state of cluster. A CI procedure implements safe transitions from current state to target state. Risky transitions require additional approval, for example, editing a second file or executing a command manually. Computing the difference between current state and target state is a great tool for cluster administration. Similar to RedHat s grading scripts used in courses and exams.

Deployment - ceph-container-* Requirements: ceph.conf (optional) ceph container hosts.conf Deploy and shut down first MON, this will create ceph.conf + keyring files Create ceph container disks.conf Edit all config files as necessary (manual, ansible) Populate vault with config and keyring files Restart the MON and confirm that configs are applied Deploy cluster (currently requires manual approval for MONs) Have fun!

Deployment - hosts.conf # SR 113 # SR 113 TL # CON 161A Server room Tape library Container # HOSTING is a space separated list of ceph daemon # types / cluster services running on a host. # HOST LOCATION HOSTING ceph 01 ceph 02 ceph 03 SR 113 SR 113 TL CON 161A MON MON MON ceph 04 ceph 05 ceph 06 ceph 07 SR 113 SR 113 SR 113 SR 113 OSD OSD OSD OSD ceph 08 CON 161A OSD MGR MGR MGR MDS HEAD

Deployment - disks.conf # HOST [...] DEV ceph 03 ceph 03 ceph 04 ceph 04 [...] ceph 04 ceph 04 ceph 04 ceph 04 ceph 04 ceph 04 ceph 04 ceph 04 [...] SIZE USE TYPE WWN /dev/sda /dev/sdb 111.3G 558.4G boot data SSD HDD wwn 0x6588... wwn 0x6588... /dev/sda /dev/sdb 372.6G 372.6G OSD OSD SSD SSD wwn 0x58ce... wwn 0x58ce... /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq 8.9T 8.9T 8.9T 8.9T 8.9T 8.9T 8.9T 111.3G OSD OSD OSD OSD OSD OSD OSD boot HDD HDD HDD HDD HDD HDD HDD SSD wwn 0x5000... wwn 0x5000... wwn 0x5000... wwn 0x5000... wwn 0x5000... wwn 0x5000... wwn 0x5000... wwn 0x6588...

Notation - Monitor/Manager host (Monitor) - OSD host (OSD) - MDS host (MDS) - OSD and MDS co-located - Ceph client (client)

Distribution of Servers Serverroom Container

Failure domains (fair fault tolerance) Serverroom Container Each OSD server split up into 2 failure domains. At most 2 disks per OSD server part of a placement group. SR has 8 and container has 16 failure domains. Pools we plan to use: 3(2) and 4(2) rep, 6+2, 8+2, 8+3 EC.

Failure domains (fair fault tolerance) Serverroom Container [...] [osd.0] crush location = "datacenter=risoe room=sr 113 host=c 04 A" [osd.4] crush location = "datacenter=risoe room=sr 113 host=c 04 A" [...]

Failure domains (fair fault tolerance) Serverroom Container Loss of 1 server (2 failure domains) implies: Replicated 3(2) pool might fail (low probability). Replicated 4(2) pool is OK. EC 6+2 pool in SR just about OK (set min_size=6 and hope for the best?). EC 8+2 or 8+3 pool in container is OK.

Failure domains (fair fault tolerance) Serverroom Container Temporary workarounds (or take the risk for a while): Replicated 3(2) pool: check for critical PGs and upmap. define 1 failure domain per host for SSD pools. Replicated 4(2) pool is OK. EC 6+2 pool in SR: allocate 2 PGs in container. EC 8+2 or 8+3 pool in container is OK.

Benchmark results Not easy to find actual test results for performance as a function of the EC profile. The only best practice like recommendations I could find were use 4+2 and 8+3 is good. No reasons given. Our original plan was to use 5+2 and 10+4 EC profiles for low replication overhead with high redundancy. Questions: What is the theoretical limit? How close do we get? Which EC profiles perform best? Is there a difference? Other parameters?

Benchmark results Random write test 4K write size, IOP/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing Some client and storage nodes on same switch (first two columns), else different switches. Pool (location disk type EC coding/rep profile) CON HDD/HDD CON HDD 5+2 CON HDD 5+2 CON HDD 10+4 SR HDD 5+2 5+2 SMALL SMALL NO BOND 320K 320K 320K 259.60 428.18 463.11 398.72 453.29 433.31 861.74 1069.20 1100.48 436.24 536.68 502.84 360.58 455.62 450.98 1280K 1280K 1280K 343.91 438.53 466.93 422.29 400.74 512.11 865.39 1008.84 937.68 427.49 497.62 518.08 454.62 448.89 505.47 5M 5M 5M 424.76 456.21 454.36 476.58 477.06 485.15 1166.09 873.38 802.30 496.41 455.71 421.80 537.41 488.21 480.38

Benchmark results Random write test 4K write size, IOP/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing Client and storage nodes on different switches. Pool (location disk type EC coding/rep profile) SR HDD 6+2 CON HDD 6+2 CON HDD 12+4 384K 384K 384K 622.16 689.44 600.36 1418.71 1544.20 1281.18 524.62 606.72 639.11 384K 384K 384K 1 4 2 4 4 4 661.19 702.17 657.70 1542.01 1377.84 1396.64 600.43 654.65 644.33 1536K 1536K 1536K 974.09 724.26 724.20 2054.86 2006.42 1400.38 942.55 740.27 635.85 1536K 1536K 1536K 1 4 2 4 4 4 875.82 777.69 764.84 1620.05 1584.98 1495.50 776.16 742.37 694.70 6M 6M 6M 961.19 624.07 687.98 1867.18 1847.60 1323.47 914.07 773.77 669.21 6M 6M 6M 1 4 2 4 4 4 826.29 769.47 806.14 1620.28 1713.53 1567.61 814.23 809.97 774.56

Benchmark results Random write test 4K write size, IOP/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing SR HDD 6+2 Client and storage nodes on different switches. Pool (location disk type EC coding/rep profile) CON HDD 6+2 CON HDD 8+2 CON SSD 8+2 CON HDD x3 CON SSD x3 512K 512K 512K 696.25 684.05 729.52 1342.67 1260.89 1401.56 1266.57 1293.89 1198.43 512K 512K 512K 1 4 2 4 4 4 646.21 750.72 714.98 1134.43 1275.48 1218.68 1178.40 1207.73 1183.10 2048K 2048K 2048K 605.62 828.63 775.28 1726.62 1600.24 1129.25 1572.46 1620.74 1273.16 2048K 2048K 2048K 1 4 2 4 4 4 760.95 825.28 741.86 1440.60 1518.04 1301.58 1295.73 1372.15 1232.11 974.26 734.59 700.32 1632.52 1685.02 1188.26 1505.21 1697.48 1210.27 11821.96 19666.92 20703.94 13080.43 13282.64 9178.39 19802.10 41696.83 74143.42 1 4 2 4 4 4 873.33 841.76 767.25 1538.51 1591.45 1384.43 1520.18 1470.63 1343.34 19933.91 20311.90 19997.78 8511.53 7824.52 7716.51 76655.96 134841.27 134149.55

Benchmark results Sequential write test, MB/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing 5M write size. Some client and storage nodes on same switch (first two columns), else different switches. Pool (location disk type EC coding/rep profile) CON HDD/HDD CON HDD 5+2 CON HDD 5+2 CON HDD 10+4 SR HDD 5+2 5+2 SMALL SMALL NO BOND 320K 320K 320K 153.55 253.60 271.61 174.44 246.21 255.57 484.52 562.61 489.01 170.10 202.53 253.58 222.62 257.40 261.18 1280K 1280K 1280K 192.29 269.60 363.37 201.69 285.97 364.56 482.95 673.74 676.85 591.73 781.03 736.10 372.57 372.25 365.82 5M 5M 5M 259.54 453.82 770.68 289.41 555.61 545.74 617.08 928.59 1156.05 413.43 540.03 681.22 504.50 572.95 603.86

Benchmark results Sequential write test, MB/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing 6M write size. Client and storage nodes on different switches. Pool (location disk type EC coding/rep profile) SR HDD 6+2 CON HDD 6+2 CON HDD 12+4 384K 384K 384K 272.97 319.87 295.44 440.64 629.66 605.62 564.08 546.44 507.71 384K 384K 384K 1 4 2 4 4 4 395.12 431.99 466.08 601.42 762.91 878.84 654.88 792.31 791.70 1536K 1536K 1536K 491.43 705.17 751.44 579.08 1027.69 1462.08 716.69 1052.04 1064.50 1536K 1536K 1536K 1 4 2 4 4 4 797.95 843.91 870.31 1096.07 1653.20 1537.30 1085.04 1274.31 1332.01 6M 6M 6M 518.28 780.75 851.95 617.57 1078.99 1641.57 537.25 883.57 977.42 6M 6M 6M 1 4 2 4 4 4 982.06 1086.49 1121.48 1099.89 2007.42 2047.80 961.50 1399.82 1415.06

Benchmark results Sequential write test, MB/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing write size. Client and storage nodes on different switches. Pool (location disk type EC coding/rep profile) SR HDD 6+2 CON HDD 6+2 CON HDD 8+2 CON SSD 8+2 CON HDD x3 CON SSD x3 512K 512K 512K 269.91 289.82 296.99 538.59 628.09 572.89 834.03 1102.23 988.78 512K 512K 512K 1 4 2 4 4 4 394.93 448.76 476.23 676.98 779.46 828.17 1107.66 1424.50 1554.29 2048K 2048K 2048K 542.33 682.56 677.14 688.06 1058.38 1353.01 880.93 1456.48 1853.81 2048K 2048K 2048K 1 4 2 4 4 4 882.91 851.52 857.32 1110.65 1632.33 1596.98 1179.36 2200.24 3032.32 469.44 779.78 994.23 725.03 1248.54 1873.67 855.70 1558.90 2510.78 1016.07 1992.77 3471.05 836.59 1321.90 1617.95 1098.89 2033.48 2938.83 1 4 2 4 4 4 976.41 1600.65 1726.76 1130.60 2146.36 3139.14 1170.23 2321.77 4069.27 1182.62 2372.27 4657.92 1167.97 2006.99 1937.81 1194.96 2382.88 3538.56

Benchmark results - winners Random write test 4K write size, IOP/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing SR HDD 6+2 Client and storage nodes on different switches. Pool (location disk type EC coding/rep profile) CON HDD 6+2 CON HDD 8+2 CON SSD 8+2 CON HDD x3 CON SSD x3 974.26 734.59 700.32 1632.52 1685.02 1188.26 1505.21 1697.48 1210.27 11821.96 19666.92 20703.94 13080.43 13282.64 9178.39 19802.10 41696.83 74143.42 1 4 2 4 4 4 873.33 841.76 767.25 1538.51 1591.45 1384.43 1520.18 1470.63 1343.34 19933.91 20311.90 19997.78 8511.53 7824.52 7716.51 76655.96 134841.27 134149.55 SR HDD 6+2 CON HDD 6+2 CON HDD 8+2 CON SSD 8+2 CON HDD x3 CON SSD x3 : : : : : : 6+2 EC pool on 4 OSD hosts with 2 shards per host 6+2 EC pool on 8 OSD hosts with up to 2 shards per host 8+2 EC pool on 8 OSD hosts with up to 2 shards per host 8+2 EC pool on 8 OSD hosts with up to 2 shards per host size=3, min_size=2 repl. pool on 8 OSD hosts with up to 2 replicas per host size=3, min_size=2 repl. pool on 8 OSD hosts with up to 2 replicas per host

Benchmark results - winners Sequential write test, MB/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing 6M/ write size. Client and storage nodes on different switches. Pool (location disk type EC coding/rep profile) SR HDD 6+2 CON HDD 6+2 CON HDD 8+2 CON SSD 8+2 CON HDD x3 CON SSD x3 469.44 779.78 994.23 725.03 1248.54 1873.67 855.70 1558.90 2510.78 1016.07 1992.77 3471.05 836.59 1321.90 1617.95 1098.89 2033.48 2938.83 1 4 2 4 4 4 976.41 1600.65 1726.76 1130.60 2146.36 3139.14 1170.23 2321.77 4069.27 1182.62 2372.27 4657.92 1167.97 2006.99 1937.81 1194.96 2382.88 3538.56 SR HDD 6+2 CON HDD 6+2 CON HDD 8+2 CON SSD 8+2 CON HDD x3 CON SSD x3 : : : : : : 6+2 EC pool on 4 OSD hosts with 2 shards per host 6+2 EC pool on 8 OSD hosts with up to 2 shards per host 8+2 EC pool on 8 OSD hosts with up to 2 shards per host 8+2 EC pool on 8 OSD hosts with up to 2 shards per host size=3, min_size=2 repl. pool on 8 OSD hosts with up to 2 replicas per host size=3, min_size=2 repl. pool on 8 OSD hosts with up to 2 replicas per host

Benchmark results - recordings

Benchmark results - recordings

Benchmark results - recordings

Benchmark results - recordings

Troubleshooting ceph My experience so far can be summarized as If a healthy ceph cluster falls sick, it is either almost certainly not caused by ceph, or due to misconfiguration and one might have a problem requiring ceph training for resolution. This statement matches with the response I got from every ceph admin/trainer I met with. This implies, that in almost all cases ceph trouble shooting is basically restricted to hardware health and can be done by staff without ceph training. Once hardware failures are fixed or ruled out, the cluster usually heals itself. It is rather rare that one needs help from an experienced and/or trained person during ordinary operations.

Troubleshooting ceph - is fun!

Troubleshooting ceph - is fun!

Troubleshooting ceph - is fun!

Best practices? Problems? typical recommendations and reality ceph-ansible / ceph-deploy / ceph-container communuity edition / RHE storage EC profile min_size=k+1 mystery ceph and the laws of small numbers EC pools and on storage compute EC pools / replicated pools, when and why hardware acquisition strategy which ceph version DeiC ceph admin group? partitioning of disks (containers, large logs) Do not believe. Test as much as you can.

Use latest LTS version http://docs.ceph.com/docs/mimic/install/get-packages/#add-ceph The current LTS version is Luminous (12.2.8). We are currently on 12.2.7.