TRIP: Temporal Redundancy Integrated Performance Booster for Parity-Based RAID Storage Systems

Size: px
Start display at page:

Download "TRIP: Temporal Redundancy Integrated Performance Booster for Parity-Based RAID Storage Systems"

Transcription

1 200 6th International Conference on Parallel and Distributed Systems TRIP: Temporal Redundancy Integrated Performance Booster for Parity-Based RAID Storage Systems Chao Jin *, Dan Feng *, Hong Jiang, Lei Tian *, Jingning Liu *, Xiongzi Ge * * Wuhan National Lab for Optoelectronics * School of Computer Science & Technology, Huazhong University of Science & Technology, Wuhan, China chjinhust@gmail.com, dfeng@hust.edu.cn, j.n.liu@63.com, xiongzige@gmail.com Department of Computer Science & Engineering, University of Nebraska-Lincoln jiang@cse.unl.edu, tian@cse.unl.edu Abstract Parity redundancy is widely employed in RAID-structured storage systems to protect against disk failures. However, the small-write problem has been a persistent root cause of the performance bottleneck of such parity-based RAID systems, due to the additional parity update overhead upon each write operation. In this paper, we propose a novel RAID architecture, TRIP, based on the conventional parity-based RAID systems. TRIP alleviates the small-write problem by integrating and exploiting the temporal redundancy (i.e., snapshots and logs) that commonly exists in storage systems to protect data from soft errors while boosting write performance. During the write-intensive periods, TRIP can reduce the penalty of each small-write request to as few as one device IO operation, at a minimal cost of maintaining the temporal redundant information. Reliability analysis, in terms of Mean Time to Data Loss (MTTDL), shows that the reliability of TRIP is only marginally affected. On the other hand, our prototype implementation and performance evaluation demonstrate that TRIP significantly outperforms the conventional parity-based RAID systems in data transfer rate and user response time, especially in write-intensive environments. Keywords small write; parity redundancy; temporal redundancy; snapshot; log; performance booster I. INTRODUCTION The Redundant Array of Independent Disks (RAID) [] architecture has been popular in the storage systems for decades due largely to its major advantages in high performance and fault tolerance. Among the commonly used RAID levels, RAID0 provides the highest performance. However, it is vulnerable to disk failures, since it does not employ any data protection technique. By maintaining the parity check information within disk arrays, the parity-based RAID systems (e.g., RAID5 or RAID6) can rebuild lost data from certain disk failures. The parity-based RAID systems provide comparable read performance with RAID0 systems. However, they suffer from severe write performance degradation, since they must additionally update the parity blocks upon each write operation. The situation worsens in write intensive environments with small request sizes. Taking RAID5 as an example, when a small write request comes for a data block, first the old content of that data block and the corresponding parity block are read from the disks into the buffer, then the new content of the parity block is generated by the XOR calculation on the fly, and finally this newly generated parity block and the new content of the data block are written onto the disks. Thus, a write to a data block in a RAID5 system incurs as many as four expensive disk IO operations. This problem, called the small-write problem, severely degrades the performance of the parity-based RAID systems. In this paper, we propose a novel RAID architecture, called TRIP, based on traditional parity-based RAID systems. TRIP integrates and exploits temporal redundancy, including snapshots and logging, to overcome the small-write problem of parity-based RAID systems. Temporal redundancy [2] widely exists in storage systems to protect data from soft errors. In contrast to the parity information, which we call spatial redundancy, temporal redundancy is another type of data redundancy that has the ability to restore the files to a previous state along the timeline once data corruption or intrusion occurs. The main idea behind TRIP is that, during write-intensive periods, the TRIP system suspends parity updates and operates like RAID0 while the disk array is being temporarily protected by the temporal redundancy information stored in snapshots or logs. When the system becomes idle or lightly loaded, the delayed parity update operations are performed and the system returns to the normal state, in which the parity is consistent with the data. In most storage systems, the snapshot and logging modules have been integrated as the fundamental functionalities, so the implementation of TRIP would simply make use of these existing modules and thus require minimal modifications to the existing systems. The main contributions of this paper include: () We reveal the fact that there usually exist overlaps between different forms of redundancy in the storage systems, namely, the temporal redundancy can help to recover spatial disk failures if it is configured properly. (2) We propose the TRIP architecture to boost the write performance of the parity based RAID systems. TRIP employs the temporal redundancy generated by the snapshot and logging modules to protect the disk array from disk failures, and suspends the parity update operations to accelerate the write speed during write-intensive periods. (3) We implement the TRIP prototype in the Linux software RAID framework, and evaluate its performance through extensive benchmark-driven and trace-driven experiments. Experimental results show that TRIP significantly increases the data /0 $ IEEE DOI 0.09/ICPADS

2 transfer rate and decreases the request response time compared with the conventional parity based RAID systems. The rest of the paper is structured as follows. The next section gives a detailed description of TRIP. We analyze the reliability of TRIP in Section III and evaluate the performance of TRIP through extensive experiments in Section IV. We review related work in Section V and conclude this paper in Section VI. II. A. TRIP Architecture The architecture of TRIP is shown in Figure, where, in addition to the standard RAID architecture, TRIP incorporates the snapshot and logging modules into the system. The snapshot and logging modules are used to capture and exploit the temporal redundancy. A dedicated snapshot disk, out of the RAID disks, is configured to store the snapshot information. Similarly, a dedicated log disk is used to store the logging information. Generally, TRIP can employ any of the existing and matured snapshot techniques, and we select the Copy-On-Write (COW) technique due to a combined concern of performance and space overhead. Moreover, the COW snapshot has been widely integrated in existing storage systems. For instance, Linux LVM software provides a standard block-level COW snapshot function. On the other hand, TRIP can also use any of the existing logging techniques, such as the sequential logging used by LFS [3], or the track-based disk logging [4]. We choose the sequential logging since it is easy to implement, and the space utilization of the log disk is higher than that of track-based logging. Note that our TRIP is orthogonal to the specific snapshot or logging techniques, and thus it will benefit from more efficient snapshot or logging techniques in the future. B. Process Flow The basic idea behind TRIP is to postpone the parity updates from IO intensive periods to lightly loaded periods to accelerate the write speed. However, due to reliability consideration, TRIP exploits the temporal redundancy provided by the snapshot and log modules to protect against disk failures during the parity-inconsistent periods. Normally, TRIP operates in one of the three states, notated as States A, B, and C. In State A, the TRIP system is idle or under light loads, and runs like a standard RAID. In State B, the system is under heavy loads, and the parity update operations are suspended. State C represents the transitional state from State B to State A, in which the suspended parity update operations are performed, and the system returns to the parity-consistent state. The snapshot and logging functions are activated in State B, and deactivated in States A and C. An idleness detector [5] is used to identify the system s busy and idle periods, and determine when to transit from one state to another. We assume NVRAM is used in the storage controller to protect the metadata structures against power or controller failures. The process flow of the TRIP system in the normal mode (i.e., in the absence of disk failures) is shown in Figure 2. DESIGN AND IMPLEMENTATION OF TRIP Delayed parity update complete Figure. TRIP Architecture. Create snapshot and start logging Figure 2. Normal Process Flow of TRIP. In State C, the suspended parity update operations can be performed in the following two alternative approaches. Reconstruction-write. A parity block is re-computed by performing XOR operations on all the data of the entire parity stripe, where the temporal redundant data is not used. Read-modify-write. For each data block that has been updated in State B (i.e., its original content has been copied to the snapshot), the parity block in the same parity stripe must be updated through the following three steps. First, the original content of the data block and its parity block are read from the snapshot. Second, the current content of the data block is read from the RAID. Third, the parity block is re-computed by performing XOR on the three. It must be noted that, in either approach, only the parity blocks in the modified stripes need to be re-computed. Besides the delayed parity updates, there may also be foreground write requests in State C. Fortunately, the foreground write operations do not compete with the delayed parity update operations, but cooperate with them. If the reconstruction-write approach is used in the foreground write operations, performing a foreground write operation on a parity stripe will make it return to the consistent state, and any delayed parity update operations on it can be cancelled. On the other hand, with the read-modify-write approach, writing a data block that is associated with a delayed parity update operation (i.e., this block has been updated in State B but not in State C) follows the three steps described next. 206

3 () The original content of this data block is read from the snapshot instead of the RAID; (2) The original content of the parity block in this parity stripe is read from the snapshot; (3) The parity block is re-computed by performing XOR on the original content of the data block and the parity block as well as the current content of the data block (contained in the write requests). After that, the delayed parity update operation associated with this data block can be cancelled. This process is illustrated in Figure 3. Pay attention to the fact that writing a data block that is not associated with any delayed parity update operation follows the standard parity update procedure. C. Small-Write Overhead Analysis The primary goal of TRIP is to build a high performance parity-based RAID system to weather through heavy write-intensive workloads. TRIP is based on the traditional parity-based RAID architectures (e.g. RAID5, RAID6), and aims to alleviate the small-write problem inflicted on them. However, in order to maintain the disk-failure tolerance ability of parity-based RAID architectures, TRIP trades the snapshot and logging overhead for high reliability and availability. Thus, the key question is, is the incurred snapshot and logging overhead justified by the reduced penalty of small writes? In the following, we will give a detailed analysis. As mentioned before, a small write to one data block incurs four disk IO operations (two reads and two writes) for RAID5, and no fewer than six operations for RAID6. For the TRIP system, it delays parity updates and runs as RAID0 during busy periods, thus writing a data block incurs only one disk operation. This, however, comes at the additional cost of the snapshot and logging overhead. For the Copy-On-Write snapshot activity, if a data block happens to be overwritten multiple times due to the access locality, the system only copies its original data to the snapshot volume the first time it is updated. This process incurs only two disk operations, a read from the primary data (RAID volume) and a write to the snapshot volume. Furthermore, we can also configure a small write buffer to accumulate the writes to the snapshot volume, and flush the data to the snapshot disk periodically for large sequential writes. Thus, the penalty of the write operations to the snapshot volume can be mitigated. On the other hand, for the sequential logging activity, it is easy to see that its penalty to the system is also negligible, as it does in LFS. In summary, the TRIP system incurs just one disk write and one infrequent disk read for each small write operation, with a minimal cost of sequentially writing the snapshot and log disks periodically. D. Exception Handling In addition to an array of RAID disks, TRIP integrates a snapshot disk and a log disk into the system. It is these two disks that store the temporal redundant data and provide TRIP the opportunity to delay the parity updates for D0 D D " D3 P D D D D D D D 2 P D D" D D D2 Figure 3. In State B, data block D and D 2 are updated to D and D. Their old contents are recorded in the snapshot volume, and their 2 new contents are logged in the log disk. The two records in the snapshot volume indicate two delayed parity updates. In State C, D is updated " again to D, and the parity block is subsequently updated according to " P P D D. After that, the record of D in the snapshot volume is removed, and the delayed parity update associated with D is cancelled. Namely, the foreground parity update and delayed parity update associated with one data block can be performed simultaneously through only one operation. performance improvement under write-intensive workloads. Thus, the state of the two disks is critical to the entire system, and any possible exception on them must be handled properly and immediately. The first possible exception occurs when the log disk becomes full while the system is in State B (See Figure 2). In this case, the system should turn off the snapshot and logging module immediately, and switch to State C. However, the system may still be under intensive workloads, since this is an emergent state switch not guided by the idleness detector. The concurrent normal parity updates and delayed parity updates may degrade the system performance. Thus, to guarantee the foreground application performance, we need to perform the parity updates incurred by foreground IOs prior to the delayed parity updates, and further delay the delayed parity update operations until the idleness detector notifies that the arrivals of idle periods. Fortunately, as mentioned in Section 2.2, the normal parity updates can cooperate with the delayed parity updates due to access locality. As a result, after performing the normal parity updates, many delayed parity update operations can be cancelled. The second possible exception happens when the snapshot disk (but not the log disk) is full while the system is in State B. In this case, we can certainly handle it in the same manner as the first exception. However, there is different and an optimized way to handle this. Pay attention to the fact that the system only needs to write the snapshot disk when a certain data block is overwritten for the first time. Thus, if a data block is overwritten for the first time, the system just execute the standard parity update procedure. Otherwise, it may perform the regular procedure in State B, namely, delay the parity update and log the data in the log disk. In this way, the system need not switch to State C until the log disk is full, or the system becomes idle. D 2 D 207

4 Generally, if the log disk is full, TRIP can perform as well as a regular RAID system; if the snapshot disk is full, TRIP may still perform much better than a regular RAID system. On the other hand, we may increase the capacities of the snapshot and log disks to mitigate such exceptions from happening. The third possible exception occurs when either or both the snapshot disk and log disk fail while the system is in State B. Since the temporal redundant data is damaged, the primary RAID volume is left unprotected, and an arbitrary disk failure will lead to permanent data loss. Thus, in this case we have no choice but to perform an urgent state switch to State C. Furthermore, the delayed parity updates must be executed with higher priority in the reconstruction-write approach, while the IO requests issued by foreground applications are suspended, to ensure that the system return to the parity-consistent state as soon as possible. Generally, the third exception is more urgent than the first two, in that it may degrade the performance of the RAID system significantly, and more importantly, impose a window of vulnerability to RAID. To mitigate such an exception from occurring, we may use the disks with lower failure rate (e.g., SSD and PCM) as the snapshot and log disks. We may also consider using the mirroring (i.e., RAID) technique to improve the reliability of the snapshot and logging modules, though it may incur a slightly higher write overhead. E. Failure Recovery When TRIP is in the parity-consistent state (i.e., State A as shown in Figure 2), it can certainly tolerate disk failures as a regular RAID system does. When the data and parity is inconsistent, TRIP can also recover from disk failures with the help of temporal redundant data. We notate the TRIP system based on RAID5 and RAID6 as TRIP5 and TRIP6 respectively. First we show how TRIP5 can rebuild from one disk failure through the temporal redundant data. The case for TRIP6 is very similar. For RAID5, the entire capacity is divided into multiple parity stripes across the component disks, with each disk contributing one equal-sized block to a parity stripe. Thus, to rebuild the failed disk is tantamount to rebuilding all the blocks in that disk. Now we prove that TRIP5 can recover from either of the following two failure cases. Case : a disk failure occurring in State B. Consider an arbitrary block b in the failed disk. If b has been updated in State B, it can be directly rebuilt by copying its latest version from the log disk. Otherwise, if b has not been updated, it can be rebuilt from the snapshot. In the snapshot, the parity stripes are consistent, thus, block b can be rebuilt through the standard RAID rebuild procedure, namely, executing XOR operations on all the remaining blocks in the same parity stripe. Case 2: a disk failure occurring in State C. The snapshot and logging modules stop working in State C. As shown in Figure 4, the state of the snapshot is modified by each of the update operations, and the modified snapshot is no longer the point-in-time image of the primary data. D0 D0 D P D D D D D D P D D D D D D2 D2 D P D D D D Figure 4. Each record in the snapshot volume indicates a delayed parity update. When a delayed parity update is performed, the corresponding record in the snapshot volume is removed, i.e., the snapshot is modified. The modified snapshot is defined as that, in a parity stripe, if a block does not have a record in the snapshot volume, it is regarded as the snapshot; otherwise, the record is regarded as the snapshot. In this figure, for the first parity stripe, the snapshot is DDDDP; for the second parity stripe, the snapshot is D 0 DDDP. 2 3 It must be noted that the modified snapshot is always in parity-consistent state. If the read-modify-write approach is used in the parity update operations, a certain parity block may be un-updated, partly updated, or fully updated. However, it must be noted that in any case, the modified snapshot is still in parity-consistent state. On the other hand, if the reconstruction-write approach is used, the parity block is always fully updated after an update operation. Consider an arbitrary block b in the failed disk again. If b has been updated in State B (but not in State C), its current content can be found in the log disk, as in Case. Otherwise, if b has been updated in State C, or it has not been updated in either state, it can be rebuilt from the modified snapshot, since the modified snapshot is parity-consistent. As we can see, with the help of the temporal redundancy, TRIP5 can always tolerate one disk failure even when it is in a parity-inconsistent state. Similarly, TRIP6 can also maintain the disk-failure-tolerant ability of a regular RAID6 system, namely, it can always rebuild from two concurrent arbitrary disk failures in the disk array. Take the one-dimensional Reed-Solomon RAID6 system [6] as an example. For each Reed-Solomon parity stripe with n blocks, (n-2) data blocks are encoded into 2 parity blocks, and any two arbitrary blocks in the parity stripe can be reconstructed by the other (n-2) blocks. Suppose two disk failures occur inside a Reed-Solomon based TRIP6 system. Let s denote the two lost blocks in a Reed-Solomon parity stripe as b and b2. If both b and b2 have been updated in State B (but not in State C), they can be rebuilt directly by copying their latest versions from the log disk. If neither b nor b2 has been updated in State B, they can be rebuilt according to the snapshot (in State B) or modified snapshot (in State C). Finally, if only one of them (say, b) has been updated in State B, then b can be rebuilt from the log disk, and b2 can obviously be rebuilt from the snapshot or modified snapshot. 208

5 As for the two-dimensional Parity-Array RAID6 systems, such as EVENODD [7] and P-Code [8], their parity stripes can be regarded as combinations of the one-dimensional RAID5 stripes from different slopes. When double disk failures occur, they can rebuild the lost data through a zigzag-styled reconstruction process, with each step using a RAID5-like parity equation. Thus, the Parity-Array based TRIP6 system can also be rebuilt from two disk failures when they are in a parity-inconsistent state, and their reconstruction process is similar to that of the TRIP5 system. Moreover, we can also build TRIP systems on the basis of erasure-coded storage systems that are capable of tolerating more than two disk failures, such as STAR [9], HoVer [0], and so on. All the TRIP systems maintain the same disk-failure-tolerance abilities as the original RAID systems. Generally, the more disk failures a RAID system can tolerate, the more operational overheads a small write incurs, and thus, the more performance superiority the corresponding TRIP system has over the original RAID system. III. RELIABILITY ANALYSIS In this section, we adopt the Mean Time to Data Loss (MTTDL) metric [, 2] to estimate the reliability of TRIP. TRIP is as reliable as a regular RAID system when it is in the parity-consistent state. In the subsequent discussion of this section, we only estimate the reliability of TRIP when it is in the parity-inconsistent state. However, due to the fluctuating characteristic of the workloads, TRIP would spend a significant portion of time staying in the parity-consistent state. Thus, the overall MTTDLs of the TRIP systems should be much higher than the ones estimated below. We assume that disk failures are independent events following an exponential distribution of rate, and repairs following an exponential distribution of rate. For simplicity, we do not consider the latent sector error in the system model, and leave this as our future research work. According to the conclusion about the reliability of RAID5 [, 2], MTTDL of an 8-disk RAID5 is: MTTDL RAID (8) 2 Figure 5 shows the state transition diagram for a TRIP5 system consisting of an 8-disk RAID5 set, a snapshot disk and a log disk. We assume that the snapshot disk failure and log disk failure follow exponential distributions of rate S and rate L respectively. As mentioned in Section 3.4, when either or both the snapshot disk and log disk fail, the TRIP system must perform the delayed parity update operations immediately and return to parity-consistent state. We assume that this time follows an exponential distribution of rate. State <0> represents the normal state of the system when all the disks are operational. A failure of any of the 8 disks in the RAID set would bring the system to State <>, and a subsequent failure of any of the remaining 7 disks in the RAID set would result in data loss. 8 8 S L 7 S L Figure 5. State transition diagram for TRIP5. The repair transition brings the system from State <> back to State <0>. If the snapshot disk or log disk fails in State <>, it would result in data loss. On the other hand, if the snapshot disk or log disk fails in State <0>, it would bring the system to State <2>. In State <2>, if all the delayed parity update operations are successfully performed, the system would go back to State <0>. Otherwise, if any of the 8 disks in the RAID set fails in State <2>, it would result in data loss. The Kolmogorov system of different equations describing the behavior of the TRIP5 system is: dp0 () t (8 S L) p0( t) p( t) p 2( t) dt dp () t ( 7 S L) p( t) 8 p0( t) dt dp2() t (8 ) p2( t) ( S L) p0( t) dt Where pi () t is the probability that the system is in State i with the initial condition that p 0 (0) and pi (0) 0 for i 0. The Laplace transformation of Equation (2) is as follows. * * * * sp0() s (8 S L) p0() s p() s p2() s * * * sp( s) ( 7 S L) p( s) 8 p0( s) * * * sp2() s (8 ) p2() s ( S L) p0() s Observing that the Mean Time to Data Loss (MTTDL) of the system is given by []: MTTDL pi (0) i According to Equations (3) and (4), we can work out the MTTDL of the TRIP5 system that is composed of an 8-disk RAID5 set, a snapshot disk, and a log disk. Since the resultant expression is rather complex (ratio of two large polynomials), it is not displayed here. We present the computed values in Figure 6 instead. 209

6 shows that the reliability of TRIP is only marginally affected. Thus, the overall MTTDL of TRIP should be comparable with that of a standard RAID system. Figure 6. MTTDL attained by different disk arrays. Figure 6 plots MTTDL as a function of MTTR (Mean Time to Repair) for an 8-disk RAID5, a TRIP5-H system, and a TRIP5-S system respectively. Each TRIP system is composed of an 8-disk RAID5 volume, a snapshot disk and a log disk. TRIP5-H uses traditional hard disks (same as the RAID disks) as the snapshot disk and log disk. TRIP-S uses solid state disks (SSD) as the snapshot disk and log disk. The RAID disk failure rate is assumed to be one failure every fifty thousand hours, which was revealed by recent studies [3]. The snapshot disk failure rate S and the log disk failure rate L are determined by the different disks we use. For TRIP-H, we assume S L. For TRIP-S, the values of S and L are about two hundred thousand hours [4]. The value of is assumed to be twenty-four hours, which is actually overestimated for TRIP. Disk repair times are expressed in days and MTTDLs are expressed in years. From Figure 6, one can see that the MTTDL of TRIP5 is slightly worse than RAID5, by 30% for TRIP5-H and 0% for TRIP-S on average. Generally, both RAID5 and TRIP5 can tolerate exactly one disk failure, but pay attention to the fact that, more disks in a disk array incurs a lower MTTDL, thus, the MTTDL of TRIP5 decreases due to the additional snapshot disk and log disk besides the RAID disks. However, as we know, most deployment of RAID5 is not only for its high reliability, but also for its high performance. Sacrificing an insignificant amount of MTTDL is worthwhile for boosting the I/O performance, especially for write-intensive applications. Besides, it is an interesting fact that the MTTDL of a TRIP5-H system consisting of an 8-disk RAID5 volume, a snapshot disk and a log disk, is even higher than that of a regular RAID5 system with 0 disks. Similar to TRIP5, we can also work out the MTTDLs of TRIP-6 according to its state transition diagram. The reliability model shows that an 8-disk regular RAID6 system has an average MTTDL of 60 years. The MTTDLs of TRIP6-H and TRIP6-S are much lower than RAID6, for just 8 and 679 years on average respectively. However, their MTTDLs are still much better than RAID5, by up to 225% and 850% on average. Moreover, as mentioned before, the above analysis is actually underestimated since it only refers to the parity-inconsistent state of TRIP. Even though, the analysis IV. PERFORMANCE EVALUATION A. Experimental Setup We have implemented a TRIP prototype in the Linux Software RAID framework for its performance evaluation that is conducted on a platform of server-class hardware with an Intel Xeon 3.0GHz processor and GB DDR memory. We use a Marvel MV88SX608 controller card to house 8 SATA disks. A separate IDE disk is used to house the operating system (Linux kernel ) and other software (MD and mdadm). The traces used in our experiments are obtained from the Storage Performance Council [5]. The two financial traces were collected from OLTP applications running at a large financial institution. They represent different access patterns in terms of write ratio, IOPS and average request size, as shown in Figure 9. The trace replay tool used is RAIDmeter [6] that replays block-level traces and is used to evaluate the user response time of storage systems. B. Benchmark-Driven Evaluation Results We first conduct an experiment on different RAID architectures using IOmeter [9] with different access patterns to obtain their performances in data transfer rate (MB/s). Each RAID set has 6 disks, and the stripe unit size is 64KB. The RAID5 and RAID6 controllers are implemented in the Linux MD module, and the RAID6 coding algorithm uses the Reed-Solomon code [6]. From Figure 7(a), we can see that TRIP performs much better than RAID5 and RAID6 for random write requests. Specifically, TRIP outperforms RAID5 and RAID6 by 244% and 406% on average respectively. This is because TRIP eliminates the overhead of reading and writing the parity blocks that severely burdens its two RAID counterparts. On the other hand, TRIP performs slightly worse than RAID0, by 9% on average. This inferiority of TRIP to RAID0 is mainly due to the additional overhead incurred by the snapshot and logging activities of the former. For the sequential write requests, as shown in Figure 7(b), while TRIP still outperforms RAID5 and RAID6 by 229% and 284% on average, the performance gap between TRIP and RAID0 is widened. The reason is that the sequential write transfer rate is much higher than the random write transfer rate and, under the sequential write environment, the single log disk may become a performance bottle neck of the system when the transfer rate increases. One potential solution to this problem is to distribute the log capacity among the RAID disks. As for the read performance, we can see from Figures 7(c) and 7(d) that all the four systems perform relatively closely in both the random and sequential environment. This is consistent with our intuition in that there is no parity update or snapshot and logging activity for the read requests. 20

7 Figure 7. Transfer rate comparison with different access patterns. Figure 9. The trace characteristics. Figure 8. Response time comparison under two Financial traces. C. Trace-Driven Evaluation Results We conduct the second experiment on the four systems driven by the two financial traces. Figure 8 shows their measured performance in term of average response time. From Figure 8, we can see that the average response time of TRIP is lower (and thus better) than RAID5 and RAID6 by up to 22% and 33.5% respectively for Financial, and by up to % and 9% for Finacial2. Pay attention to the fact that TRIP gains a larger performance improvement on Financial than on Financial2. The reason is that Financial has a higher write ratio than Financial2 (See Figure 9), and TRIP is more capable of handling the write requests than the traditional parity-based RAID systems. Compared with RAID0, the average response time of TRIP is about 44% and 25% higher for the two traces respectively. However, unlike TRIP, RAID0 has no ability to tolerate disk failures; thus, the deployment of RAID0 is quite limited in production environments. V. RELATED WORK Parity Logging. Stodolsky et al. [7] proposed a scheme called Parity Logging to eliminate the RAID-5 small-write problem. Upon a write request to a data block, the new content of the data block is written in place, while the XOR result of the old and new content of the data block is recorded in a log disk using sequential logging. Thus, Parity Logging reduces the small-write penalty to two disk IO operations, namely, reading the old content of the data block from the disk, and writing the new content of the data block to the disk. However, the parity-logging scheme has its disadvantage compared with TRIP. Parity Logging simply records every intermediate state of the parity blocks. If a parity block is overwritten multiple times, the recovery chain can become relatively long. When it is time to apply the changes to the parity block, the system must track the log records one by one from the head of the log disk, and restore every historical version of the parity block along the time sequence, even though only the latest version is useful. TRIP differs from Parity Logging in two significant ways. First, TRIP needs only one step to update a parity block to its latest version and, thus, a significant amount of time and memory space can be saved. Second, when Parity Logging performs the delayed parity updates, the foreground write operations must be suspended. However, as shown in Figure 3, TRIP is able to perform the foreground and delayed parity update operations simultaneously, thus enabling the primary data to be always accessible. AFRAID. AFRAID [8] tries to strike a good balance between performance and reliability for RAID5 systems. For certain periods, AFRAID stops parity updates and runs like RAID0 to provide high performance. The stale parity blocks are marked as dirty in a NVRAM-held bitmap. During these periods, the occurrence of one disk failure would lead to permanent data loss. In other periods, the dirty parity blocks 2

8 are updated, and the system returns to the normal RAID5 state. Through managing these two kinds of periods, AFRAID provides a controlled tradeoff between reliability and performance. The major drawback of AFRAID is that it cannot always recover from one disk failure like a standard RAID-5 system. Both Parity Logging and AFRAID maintain the data organization of a standard RAID system, and simply delay or skip parity updates to boost performance. While TRIP shares a similar principle with Parity Logging and AFRAID, TRIP employs several different techniques to ensure both performance and reliability. Beyond these schemes, there are also some other schemes such as Hot Mirroring [20] and Dynamic Striping [2]. Hot Mirroring uses mirroring for frequently updated (hot) data and parity logging for cold data. Dynamic Striping groups several small updates into a full stripe write, and writes it consecutively to a new place (i.e., not overwriting the original data blocks). Both schemes need to change the original data organization of the RAID systems, which may lose the spatial locality of data. Hu and Yang proposed the DCD (Disk Caching Disk) [22] scheme for disk-based storage systems. DCD uses an additional cache disk to collect small updates, and propagates them to their original places later. Since the cache disk is accessed efficiently through large sequential operations, the performance of the entire system can be improved. However, the DCD scheme cannot be applied to RAID5 directly, for all the updates will be lost if the cache disk fails. VI. CONCLUSION In this paper, we present a new RAID architecture called TRIP. TRIP integrates and exploits temporal redundancy, by means of the snapshot and logging techniques that are commonly deployed in storage systems, to significantly alleviate the small-write penalty and boost performance for parity-based RAID systems. During write intensive periods, TRIP delays parity updates and protects the disk array with temporal redundant data, allowing each small-write request to be serviced by as few as one disk write operation with minimal overheads due to the snapshot and logging activities. Thus, TRIP can greatly improve the performance of the parity-based RAID systems while still guaranteeing their disk-failure-tolerance capability. ACKNOWLEDGEMENT This work is supported by the 863 Project 2009AA0A40, 2009AA0A402, the National Basic Research 973 Program of China under Grant No. 20CB302300, 20CB30230, Changjiang Innovative Group of Education of China No. IRT0725, and the US NSF under Grant No. CCF , IIS REFERENCES [] D. Patterson, G. Gibson, and R. Katz, A Case for Redundant Arrays of Inexpensive Disks (RAID), In Proc. of SIGMOD 88, Jun [2] A. Azagury, M. E. Factor, J. Satran, and W. Micka, Point-In-Time Copy: Yesterday, Today and Tomorrow, In Proc. of MSST 02, Apr [3] M. Rosenblum and J. Ousterhout, The design and implementation of a log-structured file system, ACM Trans. On Computer Systems, 0():26-52, Feb [4] T. Chiueh and L. Huang, Track-Based Disk Logging, In Proc. of DSN 02, June [5] R. Golding, P. Bosch, C. Staelin, T. Sullivan and J. Wilkes, Idleness is not sloth, In Proc. of USENIX95, Jan [6] J. Plank, A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems, Software Practice and Experience, 27(9):995-02, 997. [7] M. Blaum, J. Brady, J. Bruck, and J. Menon, EVENODD: An optimal scheme for tolerating double disk failure in RAID architectures, IEEE Transactions on Computers, 44(2):92-202, 995. [8] C. Jin, H. Jiang, D. Feng and L. Tian, P-Code: A New RAID-6 Code with Optimal Properties, In Proc. of ICS 09, New York, NY, June [9] C. Huang and L. Xu, STAR: An efficient coding scheme for correcting triple storage node failures, In Proc. of FAST 05, San Francisco, Dec [0] J. L. Hafner, HoVer erasure codes for disk arrays, In Proc. of DSN 06, Philadelphia, June [] Q. Xin, E. L. Miller, T. Schwarz, D. D. E. Long, S. A. Brandt, and W. Litwin, Reliability Mechanisms for Very Large Storage Systems, In Proc. of MSST03, Apr [2] J. Pâris, T. Schwarz and D. D. E. Long, Self-Adaptive Two Dimensional RAID Arrays, In Proc. of IPCCC07, Apr [3] B. Schroeder and G. A. Gibson, Disk Failures in the Real World: What Does an MTTF of,000,000 Hours Mean to You? In Proc. of FAST07, Feb [4] SanDisk Solid State Driver. [5] OLTP Application I/O. UMass Trace Repository. [6] L. Tian, D. Feng, H. Jiang, and K. Zhou, PRO: A Popularity-based Multi-threaded Reconstruction Optimization for RAID-Structured Storage Systems, In Proc. of FAST07, Feb [7] D. Stodolsky, G. Gibson, and M. Holland, Parity Logging Overcoming the Small Write Problem in Redundant Disk Arrays, In Proc. of ISCA 93, May [8] S. Savage and J. Wilkes, AFRAID: A Frequently Redundant Array of Independent Disks, In Proc. of USENIX 96, Jan [9] IOmeter. [20] K. Mogi and M. Kitsuregawa, Hot mirroring: A method of hiding parity update and degradation during rebuilds for RAID 5, In Proc. of ACM SIGMOD 96, Montreal, Quebec, 996. [2] K. Mogi and M. Kitsuregawa, Dynamic parity stripe reorganization for RAID5 disk arrays, In Proc. of the International Conference on Parallel and Distibuted Information Systems, Austin, TX, Sep [22] Y. Hu and Q. Yang, DCD disk caching disk: A new approach for boosting I/O performance, In Proc. of ISCA 96, Philadelphia, PA, May

RAID6L: A Log-Assisted RAID6 Storage Architecture with Improved Write Performance

RAID6L: A Log-Assisted RAID6 Storage Architecture with Improved Write Performance RAID6L: A Log-Assisted RAID6 Storage Architecture with Improved Write Performance Chao Jin, Dan Feng, Hong Jiang, Lei Tian School of Computer, Huazhong University of Science and Technology Wuhan National

More information

A Comprehensive Study on RAID-6 Codes: Horizontal vs. Vertical

A Comprehensive Study on RAID-6 Codes: Horizontal vs. Vertical 2011 Sixth IEEE International Conference on Networking, Architecture, and Storage A Comprehensive Study on RAID-6 Codes: Horizontal vs. Vertical Chao Jin, Dan Feng, Hong Jiang, Lei Tian School of Computer

More information

An Architectural Approach to Improving the Availability of Parity-Based RAID Systems

An Architectural Approach to Improving the Availability of Parity-Based RAID Systems Computer Science and Engineering, Department of CSE Technical reports University of Nebraska - Lincoln Year 2007 An Architectural Approach to Improving the Availability of Parity-Based RAID Systems Lei

More information

Self-Adaptive Two-Dimensional RAID Arrays

Self-Adaptive Two-Dimensional RAID Arrays Self-Adaptive Two-Dimensional RAID Arrays Jehan-François Pâris 1 Dept. of Computer Science University of Houston Houston, T 77204-3010 paris@cs.uh.edu Thomas J. E. Schwarz Dept. of Computer Engineering

More information

P-Code: A New RAID-6 Code with Optimal Properties

P-Code: A New RAID-6 Code with Optimal Properties University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln CSE Conference and Workshop Papers Computer Science and Engineering, Department of 6-2009 P-Code: A New RAID-6 Code with

More information

SPA: On-Line Availability Upgrades for Paritybased RAIDs through Supplementary Parity Augmentations

SPA: On-Line Availability Upgrades for Paritybased RAIDs through Supplementary Parity Augmentations University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln CSE Technical reports Computer Science and Engineering, Department of 2-20-2009 SPA: On-Line Availability Upgrades for Paritybased

More information

Delayed Partial Parity Scheme for Reliable and High-Performance Flash Memory SSD

Delayed Partial Parity Scheme for Reliable and High-Performance Flash Memory SSD Delayed Partial Parity Scheme for Reliable and High-Performance Flash Memory SSD Soojun Im School of ICE Sungkyunkwan University Suwon, Korea Email: lang33@skku.edu Dongkun Shin School of ICE Sungkyunkwan

More information

Definition of RAID Levels

Definition of RAID Levels RAID The basic idea of RAID (Redundant Array of Independent Disks) is to combine multiple inexpensive disk drives into an array of disk drives to obtain performance, capacity and reliability that exceeds

More information

RAID (Redundant Array of Inexpensive Disks)

RAID (Redundant Array of Inexpensive Disks) Magnetic Disk Characteristics I/O Connection Structure Types of Buses Cache & I/O I/O Performance Metrics I/O System Modeling Using Queuing Theory Designing an I/O System RAID (Redundant Array of Inexpensive

More information

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID System Upgrade Teaches RAID In the growing computer industry we often find it difficult to keep track of the everyday changes in technology. At System Upgrade, Inc it is our goal and mission to provide

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

The term "physical drive" refers to a single hard disk module. Figure 1. Physical Drive

The term physical drive refers to a single hard disk module. Figure 1. Physical Drive HP NetRAID Tutorial RAID Overview HP NetRAID Series adapters let you link multiple hard disk drives together and write data across them as if they were one large drive. With the HP NetRAID Series adapter,

More information

ARC: An Approach to Flexible and Robust RAID Systems

ARC: An Approach to Flexible and Robust RAID Systems ARC: An Approach to Flexible and Robust RAID Systems Ba-Quy Vuong and Yiying Zhang Computer Sciences Department, University of Wisconsin-Madison Abstract RAID systems increase data storage reliability

More information

An Introduction to RAID

An Introduction to RAID Intro An Introduction to RAID Gursimtan Singh Dept. of CS & IT Doaba College RAID stands for Redundant Array of Inexpensive Disks. RAID is the organization of multiple disks into a large, high performance

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

JOR: A Journal-guided Reconstruction Optimization for RAID-Structured Storage Systems

JOR: A Journal-guided Reconstruction Optimization for RAID-Structured Storage Systems University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln CSE Conference and Workshop Papers Computer Science and Engineering, Department of 29 JOR: A Journal-guided Reconstruction

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 6 Spr 23 course slides Some material adapted from Hennessy & Patterson / 23 Elsevier Science Characteristics IBM 39 IBM UltraStar Integral 82 Disk diameter

More information

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD Linux Software RAID Level Technique for High Performance Computing by using PCI-Express based SSD Jae Gi Son, Taegyeong Kim, Kuk Jin Jang, *Hyedong Jung Department of Industrial Convergence, Korea Electronics

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

CEFT: A cost-effective, fault-tolerant parallel virtual file system

CEFT: A cost-effective, fault-tolerant parallel virtual file system University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln CSE Journal Articles Computer Science and Engineering, Department of 2-1-2006 CEFT: A cost-effective, fault-tolerant parallel

More information

PANASAS TIERED PARITY ARCHITECTURE

PANASAS TIERED PARITY ARCHITECTURE PANASAS TIERED PARITY ARCHITECTURE Larry Jones, Matt Reid, Marc Unangst, Garth Gibson, and Brent Welch White Paper May 2010 Abstract Disk drives are approximately 250 times denser today than a decade ago.

More information

In the late 1980s, rapid adoption of computers

In the late 1980s, rapid adoption of computers hapter 3 ata Protection: RI In the late 1980s, rapid adoption of computers for business processes stimulated the KY ONPTS Hardware and Software RI growth of new applications and databases, significantly

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_WP_ 20121112 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD

More information

5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1

5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1 5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1 5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks Amdahl s law in Chapter 1 reminds us that

More information

Implementation and Performance Evaluation of RAPID-Cache under Linux

Implementation and Performance Evaluation of RAPID-Cache under Linux Implementation and Performance Evaluation of RAPID-Cache under Linux Ming Zhang, Xubin He, and Qing Yang Department of Electrical and Computer Engineering, University of Rhode Island, Kingston, RI 2881

More information

Mladen Stefanov F48235 R.A.I.D

Mladen Stefanov F48235 R.A.I.D R.A.I.D Data is the most valuable asset of any business today. Lost data, in most cases, means lost business. Even if you backup regularly, you need a fail-safe way to ensure that your data is protected

More information

Reliable Computing I

Reliable Computing I Instructor: Mehdi Tahoori Reliable Computing I Lecture 8: Redundant Disk Arrays INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC) National Research Center of the Helmholtz

More information

RAID. Redundant Array of Inexpensive Disks. Industry tends to use Independent Disks

RAID. Redundant Array of Inexpensive Disks. Industry tends to use Independent Disks RAID Chapter 5 1 RAID Redundant Array of Inexpensive Disks Industry tends to use Independent Disks Idea: Use multiple disks to parallelise Disk I/O for better performance Use multiple redundant disks for

More information

HR6TAP: A Hybrid RAID6 Storage Architecture Tracking Data to Any Point-in-time for Enterprise Applications

HR6TAP: A Hybrid RAID6 Storage Architecture Tracking Data to Any Point-in-time for Enterprise Applications JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 31, 547-572 (2015) HR6TAP: A Hybrid RAID6 Storage Architecture Tracking Data to Any Point-in-time for Enterprise Applications LINGFANG ZENG 1, CHAO JIN 2,

More information

HRAID6ML: A Hybrid RAID6 Storage Architecture with Mirrored Logging

HRAID6ML: A Hybrid RAID6 Storage Architecture with Mirrored Logging HRAID6ML: A Hybrid RAID6 Storage Architecture with Mirrored Logging Lingfang Zeng, Dan Feng, Janxi Chen Qingsong Wei, Bharadwaj Veeravalli #, Wenguo Liu School of Computer, Huazhong University of Science

More information

HP AutoRAID (Lecture 5, cs262a)

HP AutoRAID (Lecture 5, cs262a) HP AutoRAID (Lecture 5, cs262a) Ali Ghodsi and Ion Stoica, UC Berkeley January 31, 2018 (based on slide from John Kubiatowicz, UC Berkeley) Array Reliability Reliability of N disks = Reliability of 1 Disk

More information

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3 EECS 262a Advanced Topics in Computer Systems Lecture 3 Filesystems (Con t) September 10 th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering and Computer Sciences University of California,

More information

IBM i Version 7.3. Systems management Disk management IBM

IBM i Version 7.3. Systems management Disk management IBM IBM i Version 7.3 Systems management Disk management IBM IBM i Version 7.3 Systems management Disk management IBM Note Before using this information and the product it supports, read the information in

More information

DATA DOMAIN INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY

DATA DOMAIN INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY WHITEPAPER DATA DOMAIN INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY A Detailed Review ABSTRACT No single mechanism is sufficient to ensure data integrity in a storage system.

More information

Distributed Video Systems Chapter 5 Issues in Video Storage and Retrieval Part 2 - Disk Array and RAID

Distributed Video Systems Chapter 5 Issues in Video Storage and Retrieval Part 2 - Disk Array and RAID Distributed Video ystems Chapter 5 Issues in Video torage and Retrieval art 2 - Disk Array and RAID Jack Yiu-bun Lee Department of Information Engineering The Chinese University of Hong Kong Contents 5.1

More information

ActiveScale Erasure Coding and Self Protecting Technologies

ActiveScale Erasure Coding and Self Protecting Technologies WHITE PAPER AUGUST 2018 ActiveScale Erasure Coding and Self Protecting Technologies BitSpread Erasure Coding and BitDynamics Data Integrity and Repair Technologies within The ActiveScale Object Storage

More information

Lenovo RAID Introduction Reference Information

Lenovo RAID Introduction Reference Information Lenovo RAID Introduction Reference Information Using a Redundant Array of Independent Disks (RAID) to store data remains one of the most common and cost-efficient methods to increase server's storage performance,

More information

V 2 -Code: A New Non-MDS Array Code with Optimal Reconstruction Performance for RAID-6

V 2 -Code: A New Non-MDS Array Code with Optimal Reconstruction Performance for RAID-6 V -Code: A New Non-MDS Array Code with Optimal Reconstruction Performance for RAID-6 Ping Xie 1, Jianzhong Huang 1, Qiang Cao 1, Xiao Qin, Changsheng Xie 1 1 School of Computer Science & Technology, Wuhan

More information

TRAID: Exploiting Temporal Redundancy and Spatial Redundancy to Boost Transaction Processing Systems Performance

TRAID: Exploiting Temporal Redundancy and Spatial Redundancy to Boost Transaction Processing Systems Performance TRAID: Exploiting Temporal Redundancy and Spatial Redundancy to Boost Transaction Processing Systems Performance Abstract In the recent few years, more storage system applications employ transaction processing

More information

SolidFire and Pure Storage Architectural Comparison

SolidFire and Pure Storage Architectural Comparison The All-Flash Array Built for the Next Generation Data Center SolidFire and Pure Storage Architectural Comparison June 2014 This document includes general information about Pure Storage architecture as

More information

An Efficient Penalty-Aware Cache to Improve the Performance of Parity-based Disk Arrays under Faulty Conditions

An Efficient Penalty-Aware Cache to Improve the Performance of Parity-based Disk Arrays under Faulty Conditions 1 An Efficient Penalty-Aware Cache to Improve the Performance of Parity-based Disk Arrays under Faulty Conditions Shenggang Wan, Xubin He, Senior Member, IEEE, Jianzhong Huang, Qiang Cao, Member, IEEE,

More information

1 of 6 4/8/2011 4:08 PM Electronic Hardware Information, Guides and Tools search newsletter subscribe Home Utilities Downloads Links Info Ads by Google Raid Hard Drives Raid Raid Data Recovery SSD in Raid

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

RAID4S: Improving RAID Performance with Solid State Drives

RAID4S: Improving RAID Performance with Solid State Drives RAID4S: Improving RAID Performance with Solid State Drives Rosie Wacha UCSC: Scott Brandt and Carlos Maltzahn LANL: John Bent, James Nunez, and Meghan Wingate SRL/ISSDM Symposium October 19, 2010 1 RAID:

More information

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed

More information

Fault-Tolerant Storage and Implications for the Cloud Charles Snyder

Fault-Tolerant Storage and Implications for the Cloud Charles Snyder Fault-Tolerant Storage and Implications for the Cloud Charles Snyder Abstract Fault-tolerance is an essential aspect of any storage system data must be correctly preserved and transmitted in order to be

More information

File systems CS 241. May 2, University of Illinois

File systems CS 241. May 2, University of Illinois File systems CS 241 May 2, 2014 University of Illinois 1 Announcements Finals approaching, know your times and conflicts Ours: Friday May 16, 8-11 am Inform us by Wed May 7 if you have to take a conflict

More information

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks RAID and AutoRAID RAID background Problem: technology trends - computers getting larger, need more disk bandwidth - disk bandwidth not riding moore s law - faster CPU enables more computation to support

More information

Repair Pipelining for Erasure-Coded Storage

Repair Pipelining for Erasure-Coded Storage Repair Pipelining for Erasure-Coded Storage Runhui Li, Xiaolu Li, Patrick P. C. Lee, Qun Huang The Chinese University of Hong Kong USENIX ATC 2017 1 Introduction Fault tolerance for distributed storage

More information

Storage. Hwansoo Han

Storage. Hwansoo Han Storage Hwansoo Han I/O Devices I/O devices can be characterized by Behavior: input, out, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 2 I/O System Characteristics

More information

Design and Implementation of a Random Access File System for NVRAM

Design and Implementation of a Random Access File System for NVRAM This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented. IEICE Electronics Express, Vol.* No.*,*-* Design and Implementation of a Random Access

More information

ASEP: An Adaptive Sequential Prefetching Scheme for Second-level Storage System

ASEP: An Adaptive Sequential Prefetching Scheme for Second-level Storage System ASEP: An Adaptive Sequential Prefetching Scheme for Second-level Storage System Xiaodong Shi Email: shixd.hust@gmail.com Dan Feng Email: dfeng@hust.edu.cn Wuhan National Laboratory for Optoelectronics,

More information

Baoping Wang School of software, Nanyang Normal University, Nanyang , Henan, China

Baoping Wang School of software, Nanyang Normal University, Nanyang , Henan, China doi:10.21311/001.39.7.41 Implementation of Cache Schedule Strategy in Solid-state Disk Baoping Wang School of software, Nanyang Normal University, Nanyang 473061, Henan, China Chao Yin* School of Information

More information

CSE 153 Design of Operating Systems

CSE 153 Design of Operating Systems CSE 153 Design of Operating Systems Winter 2018 Lecture 22: File system optimizations and advanced topics There s more to filesystems J Standard Performance improvement techniques Alternative important

More information

HP AutoRAID (Lecture 5, cs262a)

HP AutoRAID (Lecture 5, cs262a) HP AutoRAID (Lecture 5, cs262a) Ion Stoica, UC Berkeley September 13, 2016 (based on presentation from John Kubiatowicz, UC Berkeley) Array Reliability Reliability of N disks = Reliability of 1 Disk N

More information

IDO: Intelligent Data Outsourcing with Improved RAID Reconstruction Performance in Large-Scale Data Centers

IDO: Intelligent Data Outsourcing with Improved RAID Reconstruction Performance in Large-Scale Data Centers IDO: Intelligent Data Outsourcing with Improved RAID Reconstruction Performance in Large-Scale Data Centers Suzhen Wu *, Hong Jiang*, Bo Mao* Xiamen University *University of Nebraska Lincoln Data Deluge

More information

Chapter 10: Mass-Storage Systems

Chapter 10: Mass-Storage Systems Chapter 10: Mass-Storage Systems Silberschatz, Galvin and Gagne 2013 Chapter 10: Mass-Storage Systems Overview of Mass Storage Structure Disk Structure Disk Attachment Disk Scheduling Disk Management Swap-Space

More information

hot plug RAID memory technology for fault tolerance and scalability

hot plug RAID memory technology for fault tolerance and scalability hp industry standard servers april 2003 technology brief TC030412TB hot plug RAID memory technology for fault tolerance and scalability table of contents abstract... 2 introduction... 2 memory reliability...

More information

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition Chapter 10: Mass-Storage Systems Silberschatz, Galvin and Gagne 2013 Chapter 10: Mass-Storage Systems Overview of Mass Storage Structure Disk Structure Disk Attachment Disk Scheduling Disk Management Swap-Space

More information

Workload-Aware Elastic Striping With Hot Data Identification for SSD RAID Arrays

Workload-Aware Elastic Striping With Hot Data Identification for SSD RAID Arrays IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 36, NO. 5, MAY 2017 815 Workload-Aware Elastic Striping With Hot Data Identification for SSD RAID Arrays Yongkun Li,

More information

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 21: Reliable, High Performance Storage CSC 469H1F Fall 2006 Angela Demke Brown 1 Review We ve looked at fault tolerance via server replication Continue operating with up to f failures Recovery

More information

On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems (Supplementary File)

On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems (Supplementary File) 1 On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems (Supplementary File) Yunfeng Zhu, Patrick P. C. Lee, Yinlong Xu, Yuchong Hu, and Liping Xiang 1 ADDITIONAL RELATED WORK Our work

More information

A Reliable B-Tree Implementation over Flash Memory

A Reliable B-Tree Implementation over Flash Memory A Reliable B-Tree Implementation over Flash Xiaoyan Xiang, Lihua Yue, Zhanzhan Liu, Peng Wei Department of Computer Science and Technology University of Science and Technology of China, Hefei, P.R.China

More information

LAST: Locality-Aware Sector Translation for NAND Flash Memory-Based Storage Systems

LAST: Locality-Aware Sector Translation for NAND Flash Memory-Based Storage Systems : Locality-Aware Sector Translation for NAND Flash Memory-Based Storage Systems Sungjin Lee, Dongkun Shin, Young-Jin Kim and Jihong Kim School of Information and Communication Engineering, Sungkyunkwan

More information

On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice

On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice Yunfeng Zhu, Patrick P. C. Lee, Yuchong Hu, Liping Xiang, and Yinlong Xu University of Science and Technology

More information

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency Thanos Makatos, Yannis Klonatos, Manolis Marazakis, Michail D. Flouris, and Angelos Bilas {mcatos,klonatos,maraz,flouris,bilas}@ics.forth.gr

More information

Operating System Performance and Large Servers 1

Operating System Performance and Large Servers 1 Operating System Performance and Large Servers 1 Hyuck Yoo and Keng-Tai Ko Sun Microsystems, Inc. Mountain View, CA 94043 Abstract Servers are an essential part of today's computing environments. High

More information

FairCom White Paper Caching and Data Integrity Recommendations

FairCom White Paper Caching and Data Integrity Recommendations FairCom White Paper Caching and Data Integrity Recommendations Contents 1. Best Practices - Caching vs. Data Integrity... 1 1.1 The effects of caching on data recovery... 1 2. Disk Caching... 2 2.1 Data

More information

Caching and consistency. Example: a tiny ext2. Example: a tiny ext2. Example: a tiny ext2. 6 blocks, 6 inodes

Caching and consistency. Example: a tiny ext2. Example: a tiny ext2. Example: a tiny ext2. 6 blocks, 6 inodes Caching and consistency File systems maintain many data structures bitmap of free blocks bitmap of inodes directories inodes data blocks Data structures cached for performance works great for read operations......but

More information

IBM. Systems management Disk management. IBM i 7.1

IBM. Systems management Disk management. IBM i 7.1 IBM IBM i Systems management Disk management 7.1 IBM IBM i Systems management Disk management 7.1 Note Before using this information and the product it supports, read the information in Notices, on page

More information

V. Mass Storage Systems

V. Mass Storage Systems TDIU25: Operating Systems V. Mass Storage Systems SGG9: chapter 12 o Mass storage: Hard disks, structure, scheduling, RAID Copyright Notice: The lecture notes are mainly based on modifications of the slides

More information

Frequently asked questions from the previous class survey

Frequently asked questions from the previous class survey CS 370: OPERATING SYSTEMS [MASS STORAGE] Shrideep Pallickara Computer Science Colorado State University L29.1 Frequently asked questions from the previous class survey How does NTFS compare with UFS? L29.2

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Storage and Other I/O Topics I/O Performance Measures Types and Characteristics of I/O Devices Buses Interfacing I/O Devices

More information

Implementing a Statically Adaptive Software RAID System

Implementing a Statically Adaptive Software RAID System Implementing a Statically Adaptive Software RAID System Matt McCormick mattmcc@cs.wisc.edu Master s Project Report Computer Sciences Department University of Wisconsin Madison Abstract Current RAID systems

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

CS5460: Operating Systems Lecture 20: File System Reliability

CS5460: Operating Systems Lecture 20: File System Reliability CS5460: Operating Systems Lecture 20: File System Reliability File System Optimizations Modern Historic Technique Disk buffer cache Aggregated disk I/O Prefetching Disk head scheduling Disk interleaving

More information

Design, Implementation and Performance Evaluation of A Cost-Effective, Fault-Tolerant Parallel Virtual File System

Design, Implementation and Performance Evaluation of A Cost-Effective, Fault-Tolerant Parallel Virtual File System Design, Implementation and Performance Evaluation of A Cost-Effective, Fault-Tolerant Parallel Virtual File System Yifeng Zhu *, Hong Jiang *, Xiao Qin *, Dan Feng, David R. Swanson * * Department of Computer

More information

Building a High IOPS Flash Array: A Software-Defined Approach

Building a High IOPS Flash Array: A Software-Defined Approach Building a High IOPS Flash Array: A Software-Defined Approach Weafon Tsao Ph.D. VP of R&D Division, AccelStor, Inc. Santa Clara, CA Clarification Myth 1: S High-IOPS SSDs = High-IOPS All-Flash Array SSDs

More information

Reducing The De-linearization of Data Placement to Improve Deduplication Performance

Reducing The De-linearization of Data Placement to Improve Deduplication Performance Reducing The De-linearization of Data Placement to Improve Deduplication Performance Yujuan Tan 1, Zhichao Yan 2, Dan Feng 2, E. H.-M. Sha 1,3 1 School of Computer Science & Technology, Chongqing University

More information

VFS Interceptor: Dynamically Tracing File System Operations in real. environments

VFS Interceptor: Dynamically Tracing File System Operations in real. environments VFS Interceptor: Dynamically Tracing File System Operations in real environments Yang Wang, Jiwu Shu, Wei Xue, Mao Xue Department of Computer Science and Technology, Tsinghua University iodine01@mails.tsinghua.edu.cn,

More information

AOS: Adaptive Out-of-order Scheduling for Write-caused Interference Reduction in Solid State Disks

AOS: Adaptive Out-of-order Scheduling for Write-caused Interference Reduction in Solid State Disks , March 18-20, 2015, Hong Kong AOS: Adaptive Out-of-order Scheduling for Write-caused Interference Reduction in Solid State Disks Pingguo Li, Fei Wu*, You Zhou, Changsheng Xie, Jiang Yu Abstract The read/write

More information

Snapshot-Based Data Recovery Approach

Snapshot-Based Data Recovery Approach Snapshot-Based Data Recovery Approach Jaechun No College of Electronics and Information Engineering Sejong University 98 Gunja-dong, Gwangjin-gu, Seoul Korea Abstract: - In this paper, we present the design

More information

Stupid File Systems Are Better

Stupid File Systems Are Better Stupid File Systems Are Better Lex Stein Harvard University Abstract File systems were originally designed for hosts with only one disk. Over the past 2 years, a number of increasingly complicated changes

More information

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University Frequently asked questions from the previous class survey CS 370: OPERATING SYSTEMS [MASS STORAGE] How does the OS caching optimize disk performance? How does file compression work? Does the disk change

More information

A New High-performance, Energy-efficient Replication Storage System with Reliability Guarantee

A New High-performance, Energy-efficient Replication Storage System with Reliability Guarantee A New High-performance, Energy-efficient Replication Storage System with Reliability Guarantee Jiguang Wan 1, Chao Yin 1, Jun Wang 2 and Changsheng Xie 1 1 Wuhan National Laboratory for Optoelectronics,

More information

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr

More information

Implementing Software RAID

Implementing Software RAID Implementing Software RAID on Dell PowerEdge Servers Software RAID is an inexpensive storage method offering fault tolerance and enhanced disk read-write performance. This article defines and compares

More information

Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques. Tianli Zhou & Chao Tian Texas A&M University

Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques. Tianli Zhou & Chao Tian Texas A&M University Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques Tianli Zhou & Chao Tian Texas A&M University 2 Contents Motivation Background and Review Evaluating Individual

More information

DURING the last two decades, tremendous development

DURING the last two decades, tremendous development IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 6, JUNE 2013 1141 Exploring and Exploiting the Multilevel Parallelism Inside SSDs for Improved Performance and Endurance Yang Hu, Hong Jiang, Senior Member,

More information

DiskReduce: RAID for Data-Intensive Scalable Computing (CMU-PDL )

DiskReduce: RAID for Data-Intensive Scalable Computing (CMU-PDL ) Research Showcase @ CMU Parallel Data Laboratory Research Centers and Institutes 11-2009 DiskReduce: RAID for Data-Intensive Scalable Computing (CMU-PDL-09-112) Bin Fan Wittawat Tantisiriroj Lin Xiao Garth

More information

SMORE: A Cold Data Object Store for SMR Drives

SMORE: A Cold Data Object Store for SMR Drives SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John Haskins Jr.*, James Kelley, David Slik, Keith A. Smith, and Maxim G. Smith Advanced Technology Group NetApp, Inc. * Qualcomm

More information

ZFS STORAGE POOL LAYOUT. Storage and Servers Driven by Open Source.

ZFS STORAGE POOL LAYOUT. Storage and Servers Driven by Open Source. ZFS STORAGE POOL LAYOUT Storage and Servers Driven by Open Source marketing@ixsystems.com CONTENTS 1 Introduction and Executive Summary 2 Striped vdev 3 Mirrored vdev 4 RAIDZ vdev 5 Examples by Workload

More information

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Hyunchul Seok Daejeon, Korea hcseok@core.kaist.ac.kr Youngwoo Park Daejeon, Korea ywpark@core.kaist.ac.kr Kyu Ho Park Deajeon,

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

Hot Block Clustering for Disk Arrays with Dynamic Striping

Hot Block Clustering for Disk Arrays with Dynamic Striping Hot Block Clustering for Disk Arrays with Dynamic Striping = exploitation of access locality and its performance analysis = Kazuhiko Mogi Masaru Kitsuregawa Institute of Industrial Science, The University

More information

ASEP: An Adaptive Sequential Prefetching Scheme for Second-level Storage System

ASEP: An Adaptive Sequential Prefetching Scheme for Second-level Storage System JOURNAL OF COMPUTERS, VOL. 7, NO. 8, AUGUST 2012 1853 : An Adaptive Sequential Prefetching Scheme for Second-level Storage System Xiaodong Shi Computer College, Huazhong University of Science and Technology,

More information

RAPID-Cache A Reliable and Inexpensive Write Cache for High Performance Storage Systems Λ

RAPID-Cache A Reliable and Inexpensive Write Cache for High Performance Storage Systems Λ RAPID-Cache A Reliable and Inexpensive Write Cache for High Performance Storage Systems Λ Yiming Hu y, Tycho Nightingale z, and Qing Yang z y Dept. of Ele. & Comp. Eng. and Comp. Sci. z Dept. of Ele. &

More information

1. Introduction. Traditionally, a high bandwidth file system comprises a supercomputer with disks connected

1. Introduction. Traditionally, a high bandwidth file system comprises a supercomputer with disks connected 1. Introduction Traditionally, a high bandwidth file system comprises a supercomputer with disks connected by a high speed backplane bus such as SCSI [3][4] or Fibre Channel [2][67][71]. These systems

More information

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM InfoSphere Streams v4.0 Performance Best Practices Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related

More information

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces Evaluation report prepared under contract with LSI Corporation Introduction IT professionals see Solid State Disk (SSD) products as

More information