Open vstorage EMC SCALEIO Architectural Comparison Open vstorage is the World s fastest Distributed Block Store that spans across different Datacenter. It combines ultrahigh performance and low latency connections with a data integrity that has no comparison. Data is distributed across datacenters using both Replication and Erasure Coding. Joining Performance and Integrity is not a simple bolt-on solution and requires a from-the-ground-up approach. Disk Failures, Node Failures and even Datacenter Failures do not present data loss and hence do not threaten any of your Data Integrity. You have been lead to believe that in order to have a 100% Data Loss Protection you have to compromise on Performance. While this might sound logical and acceptable, in is time to step out of the box and demand a noncompromise Storage Platform. With Open vstorage you can have your cake and eat it too! This document provides an overview of the Nexenta architecture and highlights the difference with the Open vstorage architecture. The comparison is not intended to be exhaustive, but covers the most relevant items where both solutions differ as seen from a customer perspective. Antwerpse Steenweg 19, 9080 Lochristi Belgium Phone: +32 9 324 25 74 Mail: Info@openvstorage.com
Introduction ScaleIO is EMC s software-defined, scale-out, block storage solution and is designed for large-scale datacenters. It combines multiple x86 storage nodes into a storage cluster targeted at running high bandwidth, low latency IO workloads. ScaleIO presents top-class performance results. However, to reach these performance numbers, some trade-offs had to be made in the design. This has led to certain limitations with regards to functionality and reliability. Open vstorage takes a different approach which not only results in superior performance vs. ScaleIO, but also offers more functionality. Architectural Design ScaleIO The basic components of ScaleIO1 are the ScaleIO Data Client (SDC) and the ScaleIO Data Server (SDS). The SDC is a lightweight block device driver that exposes local block volumes to applications running on the same server. The actual data is stored on storage nodes that run the SDS. The SDS manages the local storage devices (HDDs, SSDs, PCIe flash cards,...) and contributes these devices to the global storage pool. The role of the SDS is to actually perform the backend IO operations as requested by an SDC. Each ScaleIO volume is divided into 1 MB chunks. These chunks are distributed (striped) across physical disks throughout the cluster. Each chunk has 2 copies for redundancy reasons. Although chunks are 1MB in size, ScaleIO allows to read or write for example 4K instead of the full 1MB. 1 http://www.cisco.com/c/en/us/solutions/collateral/data-center-virtualization/solutions-vspex/whitepaperc11-733544.html
Source EMC 2 On a write, the SDC that exposes the ScaleIO volume, sends the IO to the primary SDS where the chunk is located. The primary SDS sends the IO to the local drive and in parallel to the secondary SDC which holds the second copy of the chunk. Only after an acknowledgment is received from the secondary SDS, the primary SDS acknowledges the write to the SDC. On reads, the SDC connects to the primary SDS to fetch the data it needs. 2 https://www.emc.com/collateral/white-papers/h14344-emc-scaleio-basic-architecture.pdf
Open vstorage The basic components of Open vstorage are the Open vstorage Edge, the Open vstorage Volume Driver and ALBA. The Open vstorage Edge exposes block devices to applications that need a volume. The Edge component communicates via RDMA, a low-latency, high-throughput networking protocol, directly with the memory of the server running the Volume Driver. The Volume Driver is the technology that converts block storage into objects (Storage Container Objects, SCO), which can be stored on the ALBA backend. This ALBA backend is a special purpose object storage solution and is made up out of storage nodes running ALBA daemons. These daemons manage the local storage devices (HDDs, SSDs, PCIe flash cards,...) and contribute these devices to the storage pools. The Volume Driver is a combination of a location-based approach (delivering performance) and a log structured approach (delivering unlimited history, unlimited snapshots, thin cloning). Each incoming 4K write is appended to the write buffer. This write buffer can actually be seen as a Transaction Log storing one or more Storage Container Objects, a consecutive group of incoming writes. The incoming write is also dispatched to a Transaction Log onto a second node in the cluster to prevent data loss. This principle of dispatching incoming writes to an additional write buffer is referred to as the Open vstorage Distributed Transaction Log (DTL). It is important to note that both the write buffer and the DTL are very small in size as they only need to hold data which is not yet protected by the ALBA backend. They are typically limited to 256MB per volume.
Once a SCO is full, it gets split into chunks, compressed, encrypted and spread across the ALBA backend for redundancy. This ALBA backend is typically built on top of a pool of large capacity drives with an allflash performance tier as acceleration layer. On reads the Volume Driver fetches the correct block directly from the right chunk on the ALBA backend via RDMA. Similarities Software-defined storage Both ScaleIO and Open vstorage are software-defined storage solutions which mean they are infrastructure agnostic and can run on any x86 hardware. The actual hardware on which the software runs has of course a huge impact on the performance numbers you can achieve. Scalability ScaleIO is designed to massively scale from 3 nodes up to thousands of nodes. Unlike most traditional storage systems (SANs), performance and throughput scales linearly with every node added. Every storage node added is used to process I/O operations as IO requests are dispersed across the nodes. The same applies to Open vstorage. It can scale to grow as big as 1024 nodes in a cluster and performance and capacity scale with every node added. Next to performance both solutions are built to scale capacity wise from a couple of TB up to a few petabyte. For example Open vstorage can scale to 30 PB in one cluster. Each node with storage which gets added to the cluster puts the capacity of the physical storage devices it controls at the disposal of the cluster wide pool of storage.
Key Differences Huge Performance Difference The performance of ScaleIO is well documented and can reach 200-250k IOPS per node3 for 4k random reads and writes. While this is impressive, Open vstorage typically offers around 500K IOPS for random reads per node, for example a Cisco UCS server with 2 Intel NVMe drives. This means Open vstorage is 3x faster than what typically is already considered as impressive storage performance. This huge performance difference can be explained by the fact that both solutions have radically different designs. ScaleIO uses a location-based approach where each volume is divided into 1MB fragments. Open vstorage uses a log-structured approach and uses RDMA to bypass the kernel and file system as much as possible. This approach leads to lower latency and hence better performance. Data safety ScaleIO uses a 2 copies replication strategy for every bit of data to safeguard against node failures. This 2-way replication strategy protects the user against a single disk failure. With growing disk drive capacity, this strategy will lead inevitably to data loss in large clusters. When a large capacity disk fails, it takes quite some time to rebuild the data from the dead device on other devices in the cluster. During this time the data, which was on the dead device is extremely vulnerable to data loss as there is only a single copy remaining in the cluster. Having a second disk failure might already result in data loss but also rebooting the wrong node at the wrong time leads to data unavailability. Also with 10TB drives and Unrecoverable Bit Error Rates (BER) of 1x10-14, it is almost assured that some data of the disk can t be read. Basically a 2-copy strategy is not a safe approach when storing large amounts of data and will lead to data loss. Open vstorage uses a different approach, which can be compared to solving a Sudoku puzzle. Each SCO, a collection of consecutive writes, is chopped up into chunks and some additional chunks are adjoined. All these chunks are distributed across all the nodes and datacenters in the cluster. The total amount of chunks can be configured but allows for example to recover from a multi node 3 https://www.cloudscaling.com/assets/pdf/h14196-esg-lab-spotlight-proven-performance-and-scalabilitywp.pdf
failure or a complete data center loss. A failure, whether it is a disk, node or data center will cross out some numbers from the complete Sudoku puzzle but as long as you have enough numbers left, you can still solve the puzzle. The same goes for data stored with Open vstorage: as long as you have enough chunks (disk, nodes or data centers) left, you can always recover the data. Basically with Open vstorage storage is truly safe. Encryption Another area where Open vstorage is a step ahead on data safety is the security of the data on the physical disks. While ScaleIO stores the data in the clear, Open vstorage uses AES 256 bit encryption when storing data. Open vstorage even allows to use a different encryption key per volume and can be integrated with different third party key management tools. Snapshots and Clones ScaleIO has limited the amount of snapshots and clones, which can be taken from a volume to 31 instances only. As ScaleIO uses a location-based approach, it needs to safeguard the old data before overwriting data at a certain location. Keeping track of this old data for a single snapshot is complex and slows down performance when snapshotting and as snapshots increase in number. Hence the decision to limit the amount of possible snapshots and clones for a single volume. Being able to create 31 clones severely limits ScaleIO as a backend for e.g. VDI implementations. Open vstorage on the other hand uses a log-structured approach on the backend where data is never overwritten but always appended. This means a snapshot for Open vstorage is a quick and low cost operation, just placing a marker behind the latest write. As snapshots are lightweight, you can create an unlimited amount of snapshots per volume. Snapshot can also be used to create clones. The clones are zero-copy and share the original data with the parent. As an unlimited amount of clones can be made, Open vstorage provides copy data virtualization (aka Actifio, Cohesity) out of the box without the need for separate backup and copy data virtualization software. Space efficiency As discussed earlier, ScaleIO uses a 2-way replication strategy. This isn t very space efficient, as every 1MB will lead to 2MB being stored on the physical disks. This means double the amount of nodes, double the amount of networking equipment, double the amount of power and cooling, basically doubling the TCO. Open vstorage is much more space efficient as it uses forward error correction, which provides 100x better reliability as compared to 2 copies, through writing only 1.25x more data. This means that for every 1MB, 1.25MB will be stored on the backend. Secondly, Open vstorage allows for the creation of different error correction policies for flash and HDD and hence the customer can select a lower redundancy factor on flash and a higher redundancy factor on HDD. In addition, contrary to ScaleIO Open vstorage can further reduce its storage footprint by compressing the fragments before storing them on SSDs and HDDs.
Multi datacenter ScaleIO is designed to be used on local networks, as it requires low latency links both on the read and the write path for performance reasons. Since the writes need to be acknowledged by 2 storage nodes, the latency for writes would be too high in case the second was to go to a second datacenter. Open vstorage is designed to store data safely and efficiently across multiple datacenters. To ensure local performance it uses an all-flash tier while for disaster recovery reasons data can be stored on a capacity tier across multiple datacenters. Open vstorage uses APE (Asynchronous Policy Enforcement) to ensure data gets spread according to the best available policy. In case data can t be written optimally, it will be first stored sub-optimally and later, when for example the network link between the datacenters is restored, data will be re-written with the optimal policy. Flash Friendly As ScaleIO is location based, it requires high endurance flash technology because random writes generate small updates across the device. Many updates of the same location within the volume also cause the flash chips to wear out faster than normal as you are constantly updating the same cells of the flash memory. Open vstorage is much more flash friendly as the write buffer sequentializes the random writes in large fragments, which can be written sequentially. Due to random write amplification, sequential write endurance for SSDs and PCIe flash cards is typically higher so the flash drives are less likely to fail with Open vstorage compared to ScaleIO. As every update to a volume also gets appended to a new SCO, overwriting the same LBA of a volume many times will not hammer the same cells of the flash memory over and over. Since Open vstorage is less intensive on flash compared to ScaleIO, cheaper, lower endurance SSDs can be used for the performance tier. Complete History and Integrated Backup As we move to an era of petabyte scale data sets that change often, the methods around backup and replication need to dramatically change to deal with the fast ingest of data. Secondly, there is a trend towards copy data virtualization whereby backup sets are used for test/dev and analytic workloads. In case of ScaleIO it is purely a primary storage system that would need separate products and tools for backup, replication and copy data virtualization. This not only adds cost and complexity but also affects storage performance as a significant amount of IOPS is wasted to make copies. Open vstorage combines the fact that it can take unlimited snapshots, create as many clones, flash acceleration and its multidatacenter spread using forward error correction to integrate backup, replication and copy data virtualization right into its architecture.
The bottom line Open vstorage and ScaleIO are software based distributed block storage solutions and they both scale very well. Although both can be installed on the same hardware, the different design approaches result in a remarkable performance difference. Open vstorage delivers about 3 times the performance of ScaleIO. Next to the difference in performance, Open vstorage is more space efficient and supports unlimited snapshots. With a large ScaleIO environment data is also at risk as a 2-disk failure might already cause data loss. Open vstorage stores data across datacenters and can survive even a complete datacenter going offline. Lastly, ScaleIO purely addresses the primary storage problem while Open vstorage addresses the entire storage lifecycle.