A fields' Introduction to SUSE Enterprise Storage TUT91098 Robert Grosschopff Senior Systems Engineer robert.grosschopff@suse.com Martin Weiss Senior Consultant martin.weiss@suse.com Joao Luis Senior Software Engineer joao@suse.com
SUSE Enterprise Storage Introduction and Overview (The Pre Sales Phase)
What is SUSE Enterprise Storage? Based on Ceph 2003: Sage Weil`s PhD thesis UCSC 2006: Open Sourced 2007-2011: Incubated by DreamHost 2012: Inktank was founded 2013: SUSE announces plans for Ceph 2014: RedHat acquires Inktank 2015: SUSE Enterprise Storage 1.0 November 2015: SUSE Enterprise Storage 2.0 January 2016: SUSE Enterprise Storage 2.1 July 2016: SUSE Enterprise Storage 3.0 End of 2016: SUSE Enterprise Storage 4.0
Design Criteria for Ceph Developement / Architecture Fault Tolerance (No Single Point Of Failure) Scalability Performance Automated Management Self Managing Self Healing Flexibility (100% Software Based) Multiple Access Protocols Runs on Commodity Hardware
Commodity Hardware Suitable Server? - Well, it depends
Expanding Storage Suitable JBODs? - Again, it depends
Acronyms Acronyms RADOS CRUSH Reliable Autonomic Distributed Object Store Controlled Replication Under Scalable Hashing RBD Rados Block Device RGW RADOS Gateway CephFS Ceph Filesystem OSD Object Storage Daemon MON Ceph Monitor MDS Metadata Server
Design Principles for Implementation Fault Tolerance Infrastructure MONs, OSDs, MDS Gateways Redundancy vs Space Efficiency Replication (Size) Erasure Coding (K+M) Configurable Redundancy Location Awareness Data Distribution Performance Bandwidth, Latency, IOPS
SUSE Enterprise Storage Components MON, OSD, MDS, Gateways
Ceph - Components HOST/VM APP RGW A web services gateway for object storage, compatible with S3 and Swift RBD A reliable, fullydistributed block device with cloud platform integration CLIENT CEPHFS A distributed file system with POSIX semantics and scaleout metadata management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors
Use Cases Cloud (OpenStack) Storage Backend Cinder / Glance Object Store Rados, S3 and Swift compatible Block Device Linux clients (RBD) Hypervisors (i.e. KVM / QEMU) iscsi VMware ESXi, Linux, Windows and other iscsi based clients CephFS (distributed scale out file system) Linux clients NFS Ganesha (future) Samba (future)
Challenges and Expectation Setting Customers need to understand the latency challenge with scale out storage. Small systems do NOT perform! Start not too small and increase size and load over time Implement SLOW and increase the load SLOW!! Give the system time to burn in and test.
SES Field Experiences (The Post Sales Phase)
The 7 Ps and what to do Proper Prior Planning Prevents P*** Poor Performance ;-)
Consulting Approach Analysis of the current environment Time synchronization and name resolution Network Infrastructure Storage Infrastructure and Clients Analysis of the requirements and the use case RTO / RPO / SLA, Fault Tolerance Performance and scalability Design the solution and select the hardware Number of servers, number of disks, memory, CPU, networking Implement the solution in an automated and repeatable fashion AutoYaST, SMT / SUSE Manager, Configuration Management Know How transfer Customer needs to understand and operate Test, test, test... and test! Performance / Latency / Bandwidth Bottom up testing disk, file system, journal, OSD bench, RADOS bench Fault Tolerance No cluster that was not tested will work as expected or required!
Design Approaches (1) Server Hardware SLES certified! RAID 1 for the OS JBOD for OSD disks Sufficient Memory + CPU
Design Approaches (2) Performance and Size Disk / Raid Controller / HBA Raid Controllers for OSD disks do not make much sense. Variations in performance with cache on HBA Journal on SSD or on same disk Number of nodes and disks, size of disks, type of disks (density) IOPS per disk / 4 (Journal in & Journal out, XFS Journal, XFS Data) Number of disks / number of OSD hosts? Impact of failures
Design Approaches (3) Network No SPOF Ceph is network sensitive! Optimize for bandwidth and low latency Separation of networks - two or more networks! iscsi on separate network, multipathing instead of bonding Connect all primary NICs to the same switch When possible use 802.3ad Configure Jumbo frames
Design Approaches (4) Creating pools Size, Backfill, Scrubbing Using Cache Tiers Persistent read cache, Fast writes but more complexity Erasure coding and RBD Placing and using Gateways Same node or different node (separate node is better but more expensive) ISCSI, RGW Rados, RGW, iscsi, CephFS Mon,OSD,rack hardware (and disk) placement. Fault Tolerance Single Datacenter Multiple Datacenter (2+1) Size Scrubbing Backfill
Real Customer Setups (1) Many customers start with 4 to 9 or more servers (1 / 2 / 3 data centers) 3 MONs 4 OSDs Additional gateways The more disks and the more servers, the better! Min. 8 Disks per server Min. 32 Disks in the cluster (4 nodes with 8 disks each) Min. 1-2 GB RAM per TB of disk space Memory can only be replaced with more memory! More for for Erasure Coding and Cache Tiering! More memory for recovery if low number of OSDs / disks in a server 1.5 GHz CPU per OSD disk More CPU for Erasure Coding More CPU for Recovery
Real Customer Setups (2) Pure SATA 4 Nodes with 16 SATA disks = 64 SATA disks Backup Solution One Pool with Size = 2 and one other pool with Size = 3 1 GBit Network only Mixed SSD and SATA 4 Nodes with 4 * SSD + 4 * SATA disks each = 16 * SSD + 16 * SATA Scenario 1: Cache Tier with 16 * SSD + EC Pool 2:1 with 16 * SATA Scenario 2: SSD Pool with 16 * SSD + SATA Pool with 16 * SATA Scenario 3: SSD Pool with 8 * SSD + SATA Pool with 16 * SATA with Journal on SSD with Ratio 2:1 10 Gbit Network and iscsi
Real Customer Setups (3) HBA / RAID Controller Ensure compatibility between disks and controller Use JBOD and do not use the RAID controller cache Different enterprise SSDs performance loss after some time, disconnect Mainboard Setup Upgrade all firmware to the latest Disable all not required components Turn off all power save features
Implementation Best Practices Concept before implementation Ensure repeatable installation with staging AutoYaST, SMT / SUSE Manager (staging!) Salt / Chef / Crowbar / DeapSee Configuration Management Latest supported drivers should be installed BTRFS for OS, XFS for data / OSDs / MONs Fault Tolerance Testing Performance Testing Tuning Documentation
Operating and Troubleshooting Common Issues (The Usage and Post Implementation Phase)
Nothing is sexier than an Healthy Cluster <example of healthy cluster> <ceph -s ceph health > * this is what you always want your cluster to look like * * sometimes this will not be so and you should know when what you are looking at is a problem or normal operation *
Common Warnings Monitors clocks are skewed monitor is down OSDs OSD is down Flag x is set (e.g., noup, noout, nobackfill, noscrub) PGs are degraded Pgs are undersized PGs are stuck (unclean, inactive, degraded, undersized, stale) Operation is blocked for x seconds
Less Common Warnings Monitors Data directory is getting full OSDs OSD is near full Pool is full Pool object quota over threshold Pool byte quota over threshold
The scary HEALTH_ERR Monitors Monitor data directory has no available space OSDs No OSDs OSD is full Pool is critically over the quota for max objects Pool is critically over the quota for max bytes PGs stuck for more than x seconds
What may have cause this? Were there power issues? Are you suffering from network issues? Did the hardware change? Did you add new nodes? Did you remove nodes? Were there any configuration changes? Crushmap?
Usual suspects - OSDs Down because they cannot be started Down because they die Too many OSDs per server for the available RAM? Catastrophic network issues that leads to OOM? Down/Out because of osdmap flags? noup, noin?
Usual suspects - PGs Not being remapped because crush rules OSDs down osdmap flags Operations are stuck because OSDs cannot talk to each other OSDs overloaded osdmap flags crush rules Down OSDs are down Degraded OSDs are down crush rules Incomplete OSDs are down
When all else fails contacting Support What we will need for a speedy resolution: What changed in the system since the last healthy state Supportconfig and logs from all the affected nodes Logs: From monitors and affected osds Appropriate debug levels, otherwise logs are close to useless! Either via injectargs, admin socket or ceph.conf ceph tell osd.1 injectargs --debug-osd 10 ceph daemon osd.1 config set debug_osd=10
Debug levels Monitor OSDs debug ms = 1 debug mon = 10 debug paxos = 10 debug ms = 1 debug osd = 10 debug filestore = 10 debug journal = 10 debug monc = 10 e.g., ceph tell mon.* injectargs --debug-ms 1 debug-mon 10 ceph tell osd.* injectargs --debug-ms 1 debug-osd 10