The All-Flash Array Built for the Next Generation Data Center SolidFire and Ceph Architectural Comparison July 2014
Overview When comparing the architecture for Ceph and SolidFire, it is clear that both are scale-out storage systems designed to use commodity hardware, and the strengths of each make them complementary solutions for datacenter design. Ceph and SolidFire are both distributed clustered storage systems designed for scale. An object-based storage system at its core, Ceph employs four types of background services/daemons including: cluster monitoring, metadata servers, object storage nodes and REST gateway. Clients communicate with the Ceph object cluster utilizing the RADOS (reliable automatic distributed object store) protocol included in the Linux kernel. In this way, Ceph employs the RADOS block-device (RBD) layer on top of the native object storage infrastructure facilitating its use as block storage. Unlike Ceph s object storage framework, SolidFire s core storage system is a distributed block-based (rather than object-based), content-addressable storage (CAS) system that distributes individual 4K blocks across a collection of storage devices. Clients communicate with a SolidFire cluster using a standard iscsi or Fibre Channel connection. While Ceph has been primarily designed around large-scale capacity optimized use-cases, SolidFire s architecture is purpose-built for performance-oriented block storage use-cases and demonstrates significant benefits in these areas specifically including: Architecture Review Data Management Ceph is a distributed clustered storage system that runs on top of Linux and uses commodity hardware. Clients communicate with a Ceph cluster using the RADOS protocol. Ceph s roots as a scalable object store are readily apparent in the architecture including a focus on supporting large numbers of devices, client-directed data placement, and relatively basic data path features. Ceph s core storage system distributes objects across a collection of devices (OSDs, typically, disk or SSD drives) using a load-balancing and mapping algorithm called CRUSH. On top of the core objectstorage platform, Ceph has layered a RBD and file-system (Ceph FS) interfaces. RBD is primarily implemented on the client side, using the object-storage layer as a base. Ceph also offers an HTTP object-storage interface, compatible with S3 and SWIFT APIs by leveraging a gateway service (RGW). The block-on-object-based data layout scheme doesn t allow for block-level de-duplication or compression, and the use of a standard file system prevents SSD-specific data layout optimization. With this in mind, use cases like distribution of content, where more copies aid in balancing load, leverage the strengths of Ceph. Conversely, use cases where many instances are accessing similar datasets like VDI, self-service cloud, development/testing, and transaction processing become costly with less than optimal performance employing Ceph for primary block storage. Agility - SolidFire provides best in class block storage solution for a wide range of performance sensitive use cases including, private cloud, hybrid cloud, enterprise databases and virtual desktop along with the ability to adapt on the fly to a multiple workload environment without affecting the performance of existing applications. Likewise, SolidFire s shared-nothing architecture allows for the addition or removal of nodes while maintaining application specific QoS (max, min and burst IOPS) settings. Scalability - Both SolidFire and Ceph are scale out architectures that allow for the addition of capacity and performance by adding nodes but only SolidFire has the ability to adjust performance for a given data set after deployment. Guaranteed - A key requirement of the next generation data center is to have an environment based on repeatable, predictable performance. Ceph offers the ability to scale-out and tune performance, but does not have the ability to specify QoS for individual volumes. SolidFire enables enterprises to specify and guarantee minimum, maximum and burst IOPS for individual storage volumes on the fly, independently of capacity, eliminating the noisy neighbor problem in mixed workload environments. Figure 1: On top of the core object-storage platform, Ceph has layered a RADOS block-device (RBD) and file-system (Ceph FS) interfaces. SolidFire is also a distributed clustered storage system that uses a Linux kernel and commodity hardware. The core storage system is a distributed block-based content-addressable storage system that distributes individual 4K blocks across a cluster of storage nodes. SolidFire uses a location-aware bin-based data placement algorithm similar to that employed by Ceph along with a second load-balancing algorithm that optimizes and adjusts client IO balancing between nodes based on QoS (quality-of-service) settings. Clients communicate with a SolidFire cluster using iscsi and/or Fibre Channel connections. 2 solidfire.com
In the SolidFire architecture, writes are coalesced in NVRAM before being written to SSD storage. Data is written to SSDs using a custom log-structured data layout that minimizes write amplification (which leads to SSD wear out) and maximizes performance. Data for an individual volume is broken into 4K blocks which are distributed through the cluster based on content, leading to an even distribution of load for a volume regardless of IO size or locality. On top of the core CAS block-storage layer, SolidFire has a metadata layer that handles block device abstraction and exposes standard block devices. SolidFire s core architecture is designed around space efficiency and delivering high levels of performance with quality-ofservice (QoS) control. SolidFire s all-ssd architecture only offers a block-storage interface currently, but does so using standard block protocols and includes a rich set of datapath features, including in-line de-duplication, compression, and thin provisioning. Management of a SolidFire cluster is accomplished via a comprehensive REST-based API, a self-hosted highly available web interface, an integrated VMware vsphere plugin, and several other management system plugins. Scale and Efficiency Ceph and SolidFire both utilize commodity hardware in scale-out architectures where capacity and performance increase in a linear fashion as nodes are added, but the similarities do not necessarily make them competing technologies. Ceph scales by adding additional object storage nodes (OSD) and metadata servers and utilizing the CRUSH mapping and loadbalancing algorithm. Read/write performance improvements can be achieved in Cepy by striping client data over multiple objects within an object set. Objects get mapped to different placement groups and further mapped to different OSDs, such that each write occurs in parallel at the maximum write speed. A write to a single disk would be limited by the bandwidth of that one device (e.g. 100MB/s). By spreading that write over multiple objects (which map to different placement groups and OSDs) Ceph can combine the throughput of multiple drives to achieve faster write (or read) speeds. It is important to note that once a striping schema has been imlemented the parameters can not be changed so performance should be tested thoroughly prior to putting into production. Figure 2: As SolidFire nodes are added to the cluster, their performance and capacity are aggregated into a pool available for assignment to LUNs/volumes. Ceph s architecture is of significant value in large-scale, capacityoptimized storage, where performance is not a significant concern. Ceph s lack of compression and de-duplication combined with its leveraging of erasure coding for object storage, highlight it as a good choice for storage of large-scale data such as backups, images, video, content archives and or other use cases where performance, compression and deduplication are not primary drivers. Conversely, use cases involving mixed workloads, large numbers of cloned VMs and database applications may struggle when deploying Ceph as primary block storage. Central to the SolidFire architecture is the metadata abstraction layer that resides atop the distributed, native block-based, contentaddressed storage (CAS) core. The metadata abstraction layer facilitates SolidFire s log-structured architecture for writing to disk. In this architecture, datasets of varying sizes are aggregated into larger segments and written down in a continuous linear fashion much like a log file. When a drive has been fully written, the process of recycling the disk space is done in a similar fashion where the valid data on disk is read in from segments partially empty (or that contain old, non-valid data), combining the good data with the incoming data stream and rewritten as a new segment. In this way, SolidFire is able to efficiently manage the writes to the SSD and utilize less expensive cmlc drives while achieving enterprise-class data durability and consistent write performance. Like Ceph, SolidFire scales-out performance and capacity by the addition of nodes. Unlike Ceph, SolidFire s clustered mesh architecture scales out such that capacity and performance can be provisioned independently, on-the-fly, at any time, to tune performance. As SolidFire nodes are added to the cluster, their performance and capacity are aggregated into a pool available for assignment to LUNs/ volumes. Uniquely, SolidFire enables enterprises to specify and guarantee minimum, maximum and burst IOPS for individual storage volumes dynamically and independently of capacity. 3 solidfire.com
Data for an RBD block device is striped across multiple objects (on multiple OSDs) to spread the data and load across the cluster. By default, each 4MB range of blocks is placed in a new object, so access within a 4MB range will be bound to a single device, while access across a wider range will utilize multiple devices. Other striping options can be configured. A Ceph cluster also requires a cluster of monitors that are used to store and coordinate changes to cluster-wide configuration and mapping. Clients contact monitors on startup to obtain a map of the cluster, and then connect to OSD servers directly to perform object IO. Management of a Ceph cluster is handled via a Linux-only command-line interface, or a Linux-based C library. RBD can be configured with a Linux-based Python wrapper API. Figure 3: Data is written to SSDs using a custom log-structured data layout that minimizes write amplification (which leads to SSD wear out) and maximizes performance.data on disk is read in from segments partially empty (or that contain old, non-valid data), combining the good data with the incoming data stream and rewritten as a new segment. Data protection in SolidFire is handled via a distributed-replication scheme that distributes redundant copies of data blocks and metadata throughout the cluster, but does not use 1:1 mirroring of drives or nodes. On device failure, rebuilds occur in a meshed fashion between all nodes and drives in a cluster to restore redundancy into existing free space. In addition to log structured writing, content addressing is a vital element of the architecture s in-line efficiency. Addressing blocks by content rather than location (e.g. by LBA) means that SolidFire s core architecture, by its very nature, facilitates in-line data reduction. Because data is addressed (and referenced) by its content, de-duplication of incoming data becomes a natural and easily handled process. Data Consistency and Power Loss handling Data protection in Ceph is handled via replication or erasure coding. Replication makes copies of objects on multiple OSDs to protect against drive or storage node failures. Erasure coding uses a multiple parity scheme to protect objects with less overhead (but more computational cost); however, it only allows full object writes and is not compatible with block or file access. Ceph data consistency and power-loss protection is achieved with commits to local log devices on the storage nodes. Typically data is sent from the client to a primary OSD, written to a local log device, and replicated to one or more secondary/tertiary OSDs and written to their log devices, before the write is acknowledged to the client. Data is asynchronously flushed from memory to the OSD backing storage. OSD-backing storage sits on top of a standard filesystem (typically XFS or EXT4) on the device, with objects stored as files on the device. SolidFire data consistency and power-loss protection is achieved through use of high-speed PCI-based NVRAM devices. Writes are replicated to multiple NVRAM locations in a cluster before being acknowledged to client. All data is checksummed when stored in NVRAM or on SSD, and checksums are verified on reads or in the background. Any checksum failures cause the data to be re-read from an alternate location and repaired. With this architectural implementation, SolidFire is able to complete drive failure rebuilds in minutes vs. hours when compared with Ceph making SolidFire ideal for mission critical business applications. QOS To deliver predictable and guaranteed storage performance, SolidFire leverages QoS performance virtualization of resources. Patented by SolidFire, this technology enables the management of storage performance independent from storage capacity. SolidFire architecture allows users to set minimum, maximum, and burst IOPS on a per volume basis. Because performance and capacity are managed independently, SolidFire clusters are able to deliver predictable storage performance to thousands of applications within a shared infrastructure. 4 solidfire.com
The Bottom Line SolidFire and Ceph are both are scale-out, distributed storage systems that run on commodity hardware. While SolidFire and Ceph share several architectural elements, other architecture choices, as well as implementation details, provide a more clear picture of the ideal use cases for each. Figure 4: SolidFire leverages QoS performance virtualization of resources. Patented by SolidFire, this technology enables the management of storage performance independent from storage capacity. As a scale-out storage architecture, Ceph allows for the scaling out of performance and capacity as well as load balancing functionality but lacks any performance or quality-of-service (QoS) related optimizations, and does not offer any built-in data-reduction capabilities. For dedicated application environments, Ceph works well. However, without any QoS control for Ceph, highly virtualized and multiple workload deployments(e.g. VDI, private cloud,it consolidation) where it is utilized, will exhibit the noisy neighbor problem. Ceph is well suited for large-capacity, low-cost, disk-based object storage. SolidFire s Guaranteed QoS architecture allows performance to be provisioned, managed, and guaranteed on a per-application basis, eliminating the impact of noisy neighbors and minimizing the impact of failures. SolidFire s shared-nothing scale out architecture, makes it ideally suited for large-scale, mixed-workload enterprise and services provider deployment. The ability to mix multiple models of nodes within clusters, scaling out performance and capacity linearly at any time (not just at initial deployment), combined with the ability to guarantee IOPS per volume means deployments can start and grow as needed without disruption to running applications or worry of stranding either performance or capacity. Adding all of this up, SolidFire s architecture means organizations can consolidate multiple applications and workloads on to an agile, scalable, predictable, automated, infrastructure saving space, time and resources, ultimately contributing to the bottom line. In the end, both capacity-optimized storage systems like Ceph and performance-optimized systems like SolidFire have a place in most data centers. Furthermore, the two can be integrated with SolidFire s API-based backup and restore feature for backup or offload to object storage. 5 solidfire.com