Ceph at DTU Risø Frank Schilder

Ceph at DTU Risø

Ceph at DTU Risø Design goals 1) High failure tolerance (long-term) single-disk blue store OSDs, no journal aggregation high replication value for 24/7 HA pools (2x2) medium to high parity value for erasure coded pools (6+2, 8+2, 8+3) duplication of essential hardware 2) Low storage cost use erasure coded pools as much as possible buy complete blobs design cluster to handle imbalance 3) Performance use SSDs for meta data pools choose EC coding profiles carefully small all-ssd pools for high I/O requirements utilize striping, if supported grow the cluster

Ceph at DTU Risø Mini-tender for core ceph hardware: 12 OSDs with 12x10TB HDD + 4x400GB SSD 3 MON/MGR 5 years warranty No separate management node yet No separate MDS yet (co-locate with OSD + extra RAM) No separate client nodes yet No storage network hardware yet Total raw storage: 1440TB HDD + 19.2TB SSD fair fault tolerance

Ceph at DTU Risø Outlook mid-term: 17 OSD servers (6 months) 3 MON/MGR 1 management server 2 separate MDS (1 year) Growing number of client nodes (DTU-ONE, HPC) Some dedicated storage network hardware 5 x 80 disk JBods (1-2 years) approximately 6PB raw storage good fault tolerance

Deployment Cluster deployment with OHPC Ceph container community edition Configuration management with ansible https://xkcd.com/1988/

Deployment Goal: Ceph cluster completely self-contained and all configuration data redundantly distributed (vault, etcd). MON nodes run essential distributed services MON, MGR, NTPD, ETCD Container and ceph status encodes current state of cluster. Configuration data encodes target state of cluster. A CI procedure implements safe transitions from current state to target state. Risky transitions require additional approval, for example, editing a second file or executing a command manually. Computing the difference between current state and target state is a great tool for cluster administration. Similar to RedHat s grading scripts used in courses and exams.

Deployment - ceph-container-* Requirements: ceph.conf (optional) ceph container hosts.conf Deploy and shut down first MON, this will create ceph.conf + keyring files Create ceph container disks.conf Edit all config files as necessary (manual, ansible) Populate vault with config and keyring files Restart the MON and confirm that configs are applied Deploy cluster (currently requires manual approval for MONs) Have fun!

Deployment - hosts.conf # SR 113 # SR 113 TL # CON 161A Server room Tape library Container # HOSTING is a space separated list of ceph daemon # types / cluster services running on a host. # HOST LOCATION HOSTING ceph 01 ceph 02 ceph 03 SR 113 SR 113 TL CON 161A MON MON MON ceph 04 ceph 05 ceph 06 ceph 07 SR 113 SR 113 SR 113 SR 113 OSD OSD OSD OSD ceph 08 CON 161A OSD MGR MGR MGR MDS HEAD

Deployment - disks.conf # HOST [...] DEV ceph 03 ceph 03 ceph 04 ceph 04 [...] ceph 04 ceph 04 ceph 04 ceph 04 ceph 04 ceph 04 ceph 04 ceph 04 [...] SIZE USE TYPE WWN /dev/sda /dev/sdb 111.3G 558.4G boot data SSD HDD wwn 0x6588... wwn 0x6588... /dev/sda /dev/sdb 372.6G 372.6G OSD OSD SSD SSD wwn 0x58ce... wwn 0x58ce... /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq 8.9T 8.9T 8.9T 8.9T 8.9T 8.9T 8.9T 111.3G OSD OSD OSD OSD OSD OSD OSD boot HDD HDD HDD HDD HDD HDD HDD SSD wwn 0x5000... wwn 0x5000... wwn 0x5000... wwn 0x5000... wwn 0x5000... wwn 0x5000... wwn 0x5000... wwn 0x6588...

Notation - Monitor/Manager host (Monitor) - OSD host (OSD) - MDS host (MDS) - OSD and MDS co-located - Ceph client (client)

Distribution of Servers Serverroom Container

Failure domains (fair fault tolerance) Serverroom Container Each OSD server split up into 2 failure domains. At most 2 disks per OSD server part of a placement group. SR has 8 and container has 16 failure domains. Pools we plan to use: 3(2) and 4(2) rep, 6+2, 8+2, 8+3 EC.

Failure domains (fair fault tolerance) Serverroom Container [...] [osd.0] crush location = "datacenter=risoe room=sr 113 host=c 04 A" [osd.4] crush location = "datacenter=risoe room=sr 113 host=c 04 A" [...]

Failure domains (fair fault tolerance) Serverroom Container Loss of 1 server (2 failure domains) implies: Replicated 3(2) pool might fail (low probability). Replicated 4(2) pool is OK. EC 6+2 pool in SR just about OK (set min_size=6 and hope for the best?). EC 8+2 or 8+3 pool in container is OK.

Failure domains (fair fault tolerance) Serverroom Container Temporary workarounds (or take the risk for a while): Replicated 3(2) pool: check for critical PGs and upmap. define 1 failure domain per host for SSD pools. Replicated 4(2) pool is OK. EC 6+2 pool in SR: allocate 2 PGs in container. EC 8+2 or 8+3 pool in container is OK.

Benchmark results Not easy to find actual test results for performance as a function of the EC profile. The only best practice like recommendations I could find were use 4+2 and 8+3 is good. No reasons given. Our original plan was to use 5+2 and 10+4 EC profiles for low replication overhead with high redundancy. Questions: What is the theoretical limit? How close do we get? Which EC profiles perform best? Is there a difference? Other parameters?

Benchmark results Random write test 4K write size, IOP/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing Some client and storage nodes on same switch (first two columns), else different switches. Pool (location disk type EC coding/rep profile) CON HDD/HDD CON HDD 5+2 CON HDD 5+2 CON HDD 10+4 SR HDD 5+2 5+2 SMALL SMALL NO BOND 320K 320K 320K 259.60 428.18 463.11 398.72 453.29 433.31 861.74 1069.20 1100.48 436.24 536.68 502.84 360.58 455.62 450.98 1280K 1280K 1280K 343.91 438.53 466.93 422.29 400.74 512.11 865.39 1008.84 937.68 427.49 497.62 518.08 454.62 448.89 505.47 5M 5M 5M 424.76 456.21 454.36 476.58 477.06 485.15 1166.09 873.38 802.30 496.41 455.71 421.80 537.41 488.21 480.38

Benchmark results Random write test 4K write size, IOP/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing Client and storage nodes on different switches. Pool (location disk type EC coding/rep profile) SR HDD 6+2 CON HDD 6+2 CON HDD 12+4 384K 384K 384K 622.16 689.44 600.36 1418.71 1544.20 1281.18 524.62 606.72 639.11 384K 384K 384K 1 4 2 4 4 4 661.19 702.17 657.70 1542.01 1377.84 1396.64 600.43 654.65 644.33 1536K 1536K 1536K 974.09 724.26 724.20 2054.86 2006.42 1400.38 942.55 740.27 635.85 1536K 1536K 1536K 1 4 2 4 4 4 875.82 777.69 764.84 1620.05 1584.98 1495.50 776.16 742.37 694.70 6M 6M 6M 961.19 624.07 687.98 1867.18 1847.60 1323.47 914.07 773.77 669.21 6M 6M 6M 1 4 2 4 4 4 826.29 769.47 806.14 1620.28 1713.53 1567.61 814.23 809.97 774.56

Benchmark results Random write test 4K write size, IOP/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing SR HDD 6+2 Client and storage nodes on different switches. Pool (location disk type EC coding/rep profile) CON HDD 6+2 CON HDD 8+2 CON SSD 8+2 CON HDD x3 CON SSD x3 512K 512K 512K 696.25 684.05 729.52 1342.67 1260.89 1401.56 1266.57 1293.89 1198.43 512K 512K 512K 1 4 2 4 4 4 646.21 750.72 714.98 1134.43 1275.48 1218.68 1178.40 1207.73 1183.10 2048K 2048K 2048K 605.62 828.63 775.28 1726.62 1600.24 1129.25 1572.46 1620.74 1273.16 2048K 2048K 2048K 1 4 2 4 4 4 760.95 825.28 741.86 1440.60 1518.04 1301.58 1295.73 1372.15 1232.11 974.26 734.59 700.32 1632.52 1685.02 1188.26 1505.21 1697.48 1210.27 11821.96 19666.92 20703.94 13080.43 13282.64 9178.39 19802.10 41696.83 74143.42 1 4 2 4 4 4 873.33 841.76 767.25 1538.51 1591.45 1384.43 1520.18 1470.63 1343.34 19933.91 20311.90 19997.78 8511.53 7824.52 7716.51 76655.96 134841.27 134149.55

Benchmark results Sequential write test, MB/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing 5M write size. Some client and storage nodes on same switch (first two columns), else different switches. Pool (location disk type EC coding/rep profile) CON HDD/HDD CON HDD 5+2 CON HDD 5+2 CON HDD 10+4 SR HDD 5+2 5+2 SMALL SMALL NO BOND 320K 320K 320K 153.55 253.60 271.61 174.44 246.21 255.57 484.52 562.61 489.01 170.10 202.53 253.58 222.62 257.40 261.18 1280K 1280K 1280K 192.29 269.60 363.37 201.69 285.97 364.56 482.95 673.74 676.85 591.73 781.03 736.10 372.57 372.25 365.82 5M 5M 5M 259.54 453.82 770.68 289.41 555.61 545.74 617.08 928.59 1156.05 413.43 540.03 681.22 504.50 572.95 603.86

Benchmark results Sequential write test, MB/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing 6M write size. Client and storage nodes on different switches. Pool (location disk type EC coding/rep profile) SR HDD 6+2 CON HDD 6+2 CON HDD 12+4 384K 384K 384K 272.97 319.87 295.44 440.64 629.66 605.62 564.08 546.44 507.71 384K 384K 384K 1 4 2 4 4 4 395.12 431.99 466.08 601.42 762.91 878.84 654.88 792.31 791.70 1536K 1536K 1536K 491.43 705.17 751.44 579.08 1027.69 1462.08 716.69 1052.04 1064.50 1536K 1536K 1536K 1 4 2 4 4 4 797.95 843.91 870.31 1096.07 1653.20 1537.30 1085.04 1274.31 1332.01 6M 6M 6M 518.28 780.75 851.95 617.57 1078.99 1641.57 537.25 883.57 977.42 6M 6M 6M 1 4 2 4 4 4 982.06 1086.49 1121.48 1099.89 2007.42 2047.80 961.50 1399.82 1415.06

Benchmark results Sequential write test, MB/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing write size. Client and storage nodes on different switches. Pool (location disk type EC coding/rep profile) SR HDD 6+2 CON HDD 6+2 CON HDD 8+2 CON SSD 8+2 CON HDD x3 CON SSD x3 512K 512K 512K 269.91 289.82 296.99 538.59 628.09 572.89 834.03 1102.23 988.78 512K 512K 512K 1 4 2 4 4 4 394.93 448.76 476.23 676.98 779.46 828.17 1107.66 1424.50 1554.29 2048K 2048K 2048K 542.33 682.56 677.14 688.06 1058.38 1353.01 880.93 1456.48 1853.81 2048K 2048K 2048K 1 4 2 4 4 4 882.91 851.52 857.32 1110.65 1632.33 1596.98 1179.36 2200.24 3032.32 469.44 779.78 994.23 725.03 1248.54 1873.67 855.70 1558.90 2510.78 1016.07 1992.77 3471.05 836.59 1321.90 1617.95 1098.89 2033.48 2938.83 1 4 2 4 4 4 976.41 1600.65 1726.76 1130.60 2146.36 3139.14 1170.23 2321.77 4069.27 1182.62 2372.27 4657.92 1167.97 2006.99 1937.81 1194.96 2382.88 3538.56

Benchmark results - winners Random write test 4K write size, IOP/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing SR HDD 6+2 Client and storage nodes on different switches. Pool (location disk type EC coding/rep profile) CON HDD 6+2 CON HDD 8+2 CON SSD 8+2 CON HDD x3 CON SSD x3 974.26 734.59 700.32 1632.52 1685.02 1188.26 1505.21 1697.48 1210.27 11821.96 19666.92 20703.94 13080.43 13282.64 9178.39 19802.10 41696.83 74143.42 1 4 2 4 4 4 873.33 841.76 767.25 1538.51 1591.45 1384.43 1520.18 1470.63 1343.34 19933.91 20311.90 19997.78 8511.53 7824.52 7716.51 76655.96 134841.27 134149.55 SR HDD 6+2 CON HDD 6+2 CON HDD 8+2 CON SSD 8+2 CON HDD x3 CON SSD x3 : : : : : : 6+2 EC pool on 4 OSD hosts with 2 shards per host 6+2 EC pool on 8 OSD hosts with up to 2 shards per host 8+2 EC pool on 8 OSD hosts with up to 2 shards per host 8+2 EC pool on 8 OSD hosts with up to 2 shards per host size=3, min_size=2 repl. pool on 8 OSD hosts with up to 2 replicas per host size=3, min_size=2 repl. pool on 8 OSD hosts with up to 2 replicas per host

Benchmark results - winners Sequential write test, MB/s total (aggregated), higher is better RBD Obj size # Nodes +threads writing 6M/ write size. Client and storage nodes on different switches. Pool (location disk type EC coding/rep profile) SR HDD 6+2 CON HDD 6+2 CON HDD 8+2 CON SSD 8+2 CON HDD x3 CON SSD x3 469.44 779.78 994.23 725.03 1248.54 1873.67 855.70 1558.90 2510.78 1016.07 1992.77 3471.05 836.59 1321.90 1617.95 1098.89 2033.48 2938.83 1 4 2 4 4 4 976.41 1600.65 1726.76 1130.60 2146.36 3139.14 1170.23 2321.77 4069.27 1182.62 2372.27 4657.92 1167.97 2006.99 1937.81 1194.96 2382.88 3538.56 SR HDD 6+2 CON HDD 6+2 CON HDD 8+2 CON SSD 8+2 CON HDD x3 CON SSD x3 : : : : : : 6+2 EC pool on 4 OSD hosts with 2 shards per host 6+2 EC pool on 8 OSD hosts with up to 2 shards per host 8+2 EC pool on 8 OSD hosts with up to 2 shards per host 8+2 EC pool on 8 OSD hosts with up to 2 shards per host size=3, min_size=2 repl. pool on 8 OSD hosts with up to 2 replicas per host size=3, min_size=2 repl. pool on 8 OSD hosts with up to 2 replicas per host

Benchmark results - recordings

Troubleshooting ceph My experience so far can be summarized as If a healthy ceph cluster falls sick, it is either almost certainly not caused by ceph, or due to misconfiguration and one might have a problem requiring ceph training for resolution. This statement matches with the response I got from every ceph admin/trainer I met with. This implies, that in almost all cases ceph trouble shooting is basically restricted to hardware health and can be done by staff without ceph training. Once hardware failures are fixed or ruled out, the cluster usually heals itself. It is rather rare that one needs help from an experienced and/or trained person during ordinary operations.

Troubleshooting ceph - is fun!

Best practices? Problems? typical recommendations and reality ceph-ansible / ceph-deploy / ceph-container communuity edition / RHE storage EC profile min_size=k+1 mystery ceph and the laws of small numbers EC pools and on storage compute EC pools / replicated pools, when and why hardware acquisition strategy which ceph version DeiC ceph admin group? partitioning of disks (containers, large logs) Do not believe. Test as much as you can.

Use latest LTS version http://docs.ceph.com/docs/mimic/install/get-packages/#add-ceph The current LTS version is Luminous (12.2.8). We are currently on 12.2.7.