Fault-Tolerant Storage and Implications for the Cloud Charles Snyder

Size: px

Start display at page:

Download "Fault-Tolerant Storage and Implications for the Cloud Charles Snyder"

Chloe Lawrence
6 years ago
Views:

1 Fault-Tolerant Storage and Implications for the Cloud Charles Snyder Abstract Fault-tolerance is an essential aspect of any storage system data must be correctly preserved and transmitted in order to be useful. The emergence of cloud computing as a service requires storage systems to be fault-tolerant in the same sense as traditional computing systems, as well as in new ways to accommodate a distributed system. This paper looks at the current state of reliability in storage systems and how things change in the cloud. 1. Introduction Reliability is a key issue in system design an unreliable system will be subject to frequent down-time and decreased availability, which in the practical world translates to a loss of customers. Designing reliable systems requires the ability to deal with faults by masking them and recovering behind the scenes, so that users are unaware that a fault has even occurred. The most common technique to achieve such fault-tolerance is through replication: creating copies of data, components, or even the entire system to hope to provide at least one correct copy at any given time. Creating fault-tolerant storage for a traditional computing system is no easy task; component degradation, physical stress, power failures, etc. can cause data corruption, data loss, or even bring down the entire storage system [1]. Luckily current magnetic disk technology has been around long enough for significant advances in fault-tolerance, though advances in storage technology offer new sets of problems and solutions with respect to reliability [2]. Fault-tolerant storage is an even greater challenge on the cloud. With the paradigm focus on up-time and reliable computing, storage systems must have higher standards of fault-tolerance. This typically means sacrificing resources for either more replication or smarter replication to keep throughput high even during high-traffic or geographically isolated system failures. 2. Basic Concepts Fault: Faults are unintended operations of the system. This term applies to incorrectly preformed read/write operations as well as storage failures; a faulty read or write will place an incorrect value into memory or storage respectively, causing errors when that value is used in the future, while a storage failure makes storage permanently or temporarily unavailable. Faulty reads and writes can be caused by component deterioration, accumulation of dust on the magnetic plate, and a variety of other noise events [1]. Storage failures generally have more serious causes overheating or power failures cause a disk or rack to become inoperable, or a natural disaster wipes out a physical data center. Reliability/Fault-tolerance: In storage systems, fault-tolerance is the ability to consistently access correct data. The longer a disk remains operable the more it can be accessed, however it must also provide non-volatile storage when it is accessed it must supply data that is consistent with past operations. Fault-tolerance is typically measured

2 by the mean time to failure (MTTF), mean time to detection (MTTD), mean time to repair (MTTR) [3], and mean time to data loss (MTTDL) metrics. Replication and Redundancy: Replication and redundancy are the main methods by which systems improve fault-tolerance. Components are replicated to account for the failure of a single (or a few) components; if several components fail, the others remain operational and assume responsibility [4]. Such components are placed redundantly so as not to cause an extreme change in control flow when a fault occurs [3] this helps to mask both that the fault has occurred and the latency involved in switching. Cloud Storage: Storage in a cloud architecture amounts to a large number of traditional storage systems with an additional layer of abstraction to provide the illusion of a single, high-capacity storage system. To ensure reliable storage not only must each disk be fault-tolerant, but the overarching system must be fault-tolerant: cloud providers often have many server farms across the world, and the system must be reliable in the face of a single disk failure as well as a farm failure. Typically a cloud storage system leaves the responsibility of data reliability to the individual disks and network-level error correction, while assuming the responsibility of masking disk and farm failures by replicating data across a handful of farms. In this paper, we examine fault-tolerance at both disk and cloud levels. 3. Traditional (Single Machine) Storage Fault-Tolerance The main ways in which traditional machines have achieved reasonably faulttolerant storage are by making storage disks themselves more fault tolerant and by designing aggregated storage systems to introduce redundancy. A physical disk is subject to wear from use and age: constant rotation and exposure to magnetic fields can degrade parts of the disk and affect its ability to store data. During operation the machine can be jarred, causing the read head to impact on the disk in a "head crash" - this causes damage to the area of impact, as well as scattering magnetic dust within the disk's sealed container that can later interfere with reading. Such degradation - in addition to manufacturing defects - render parts of the disk temporarily or permanently unusable, so the main sense of fault-tolerance in a single disk is to mask as much of the temporarily failures as possible (permanent failures cannot be overcome without some outside assistance). To achieve this fault-tolerance, levels of disk controllers have been introduced to ensure fail-fast reliable behavior. In a fail-fast system, when a component fails it immediately stops operation and alerts its superior; rather than attempting to complete a failed operation, the superior is notified and is responsible for making a more "intelligent" decision. One way a storage system can be made fail-fast is by introducing an error-detecting mechanism such as a sector checksum; if a read fails at the mechanical level some data will still be reported, but by verifying the data against a checksum the controller will detect the failure and is able to report a fail-fast error [4]. [1] discusses such an approach: at the lowest level the hardware simply reads and writes as asked, at the fail-fast level reading entails checking the sector checksum and verifying that the read data was correct (failing if it was not), and at the careful level a read entails performing fail-fast reads until one such read succeeds - if there is no success within a certain number of fail-fast reads the careful layer proposes that a hardware error has occurred, and since it cannot obtain correct data reports an error. From a higher level, such a system tolerates temporary errors such as magnetic dust on the disk by retrying in

3 hopes of a success - if the dust shifts in the process of rotating the disk, then the original data will be uncovered. However, permanent errors such as unusable sectors from the manufacturing process are insurmountable with this technique. To provide additional reliability, we can aggregate several disks into a functionally single storage system as in RAID. At the core of RAID design is the duplication and distribution of data across multiple disks; RAID 1 simply duplicates the data onto two identical disks - if disk 1 fails we simply try disk 2. This system will work through a single disk failure, and it will preserve data after a single permanent sector failure, but if both disks fail at the same sector the data will be lost. Other RAID organizations, such as RAID 4 and 5, provide the same single disk failure tolerance with increased performance. The advantage of most RAID implementations is that they not only detect and mask errors, but they also provide for error correction. In the event that an entire RAID 1 disk fails and all data is lost, the contents of the functioning disk can be copied to a new disk. Similarly for RAID implementations with parity, the missing disk contents can be inferred from the remaining disks. Unfortunately many obvious disk recovery schemes require the total use of the remaining hardware: if a RAID 4 parity disk fails, the obvious solution is to reconstruct the missing sectors by reading the corresponding data sectors and re-computing the parity. While these reads can be done in parallel, user reads cannot be done parallel to the parity construction, and so the system performance will suffer. [7] proposes a more optimal technique of recovering data from an XOR-based checksum, using fewer disk I/O operations so as to improve performance. With the emergence of new storage technology, it may become possible to make storage more innately reliable thanks to the medium. Advances in solid-state drives could lead to memory that is competitive in capacity and price (with advantages in speed), while avoiding the current manufacturing problems of bad sectors and the operational degradation from head crashes and magnetic dust. Other technologies such as phasechange memory could provide fast, dense memory without the volatility of DRAM [2]. 4. Cloud Storage Fault-Tolerance Cloud datacenters offer large volumes of storage to many users; these datacenters are composed of incredible numbers of individual storage drives located in farms across continents, yet still must appear and function as a single storage unit. For such systems it is assumed that each individual storage device functions as reliably as is necessary, and the topic of fault-tolerant cloud storage focuses on maintaining data accessibility and consistency in the face of shifting network loads and geographical disturbances. A major initial concern of cloud storage is achieving an acceptable balance between the large amounts of data and the customers' preference for fast access. If the data is placed on as few machines as possible then the cloud host will save on overhead and operating costs, but congestion at these machines will affect performance in the customers' eyes. Conversely distributing data across machines such that the amount of data on each machine is well below capacity will provide lower access times, but will incur higher operating costs (costs that will directly affect customers). In addition, saving the data on only one machine would introduce an obvious single point of failure - which, with the large network traffic seen by cloud systems, is unacceptable. Instead data is saved on multiple machines: if one machine already has a large queue of requests then the new request can be routed to a second, available machine

4 to reduce response time, similarly if the first machine overheats and becomes temporarily or permanently unavailable then the data is not lost and requests can be routed to the second machine. However, the datacenter that contains these machines becomes the new point of failure: if there is a spike in network traffic, power outage, or natural disaster, then the entire datacenter becomes unusable. Because of this it is common for cloud providers to distribute data across geographically diverse datacenters in order to ensure constant availability (an earthquake at a datacenter in California will leave a datacenter in China untouched and working). This replication introduces the problem of consistency: if a user's data is written to a number of corresponding machines, then each of these machines must contain an identical copy of the data. Luckily, protocols for establishing such consistency in a system of multiple machines already exist - such as the Paxos algorithm or the solution proposed in [6] - and it is (relatively) trivial to apply these protocols in a cloud environment. To maintain this replication consistency, the system must be able to update itself if a machine comes back online after being absent from the original write or if one of the original machines fails and must be replaced. Restoring this consistency is much the same as other multiple-storage systems like RAID, however the constant network traffic and concurrent machine accesses in a cloud environment place more emphasis on efficiency to prevent interference with other activities - so finding an optimal recovery scheme such as the one proposed in [7] is advantageous. A key advantage in attacking the problem of reliability in cloud storage is the cloud's similarity to a single-machine RAID implementation. If we view the entire cloud structure - processing machines, network, controllers, etc. - as the non-storage elements of a single-machine, then the machines aggregated from all of the datacenters become an extensive array of semi-redundant, semi-independent disks; the main difference between the cloud array and the traditional RAID array is that the cloud uses an uneven distribution - while the RAID array places data evenly according to a simple formula, data placement in the cloud appears to be more chaotic and can even be adaptive to network load and usage spikes. Regardless of this difference, viewing cloud storage at a high level as RAID allows the general application of classic tools and algorithms, as well as simplifying analysis. Indeed even the physical dissimilarities of RAID and cloud storage systems mimic each other: RAID ideally uses identical disks, but this is impossible thanks to heterogeneity imparted by disk manufacturing techniques and RAID controllers must adapt to small differences [5], while a cloud storage system intentionally aggregates heterogeneous disks. 5. Conclusions The main mechanisms used to increase the fault-tolerance of traditional storage systems are abstraction and replication: by adding additional levels of abstraction above the basic disk controller we achieve fail-fast reliable performance, while the replication of data and disks provides redundancy that can tolerate a fixed number of permanent failures. Reliable cloud storage expands on these ideas, using controllers that distribute data to provide independence and quick access - however the vast scale of cloud storage requires a more calculated approach to ensure a realistic system. Thanks to the abstract similarity of traditional and cloud storage systems, existing methods to ensure fault-

5 tolerance are often easily applied to the cloud in theory, but may require modifications in practice to accommodate heterogeneous and geographically distributed systems. 6. References [1] J. Saltzer and M. Kaashoek. Principles of computer system design: an introduction, Part II [2] K. Bailey, L. Ceze, S. Gribble, and H. Levy. Operating system implications of fast, cheap, non-volatile memory. In USENIX HotOS XIII, [3] D. Siewiorek. Architecture of fault-tolerant computers. In Computer, Vol. 17 Issue 8, [4] J. Gray and D. Siewiorek. High-availability computer systems. In Computer, Vol. 24 Issue 9, [5] E. Krevat, J. Tucek, and G. Ganger. Disks are like snowflakes: no two are alike. In USENIX HotOS XIII, [6] R. Padilha and F. Pedone. Scalable Byzantine fault-tolerant storage. In IEEE/IFIP 41 st International Conference on Dependable Systems and Networks, [7] O. Khan, R. Burns, J. Plank, and C. Huang. In search of I/O-optimal recovery from disk failures. In USENIX HotStorage 11, 2011.

An Introduction to RAID

Intro An Introduction to RAID Gursimtan Singh Dept. of CS & IT Doaba College RAID stands for Redundant Array of Inexpensive Disks. RAID is the organization of multiple disks into a large, high performance