TRIP: Temporal Redundancy Integrated Performance Booster for Parity-Based RAID Storage Systems

Size: px

Start display at page:

Download "TRIP: Temporal Redundancy Integrated Performance Booster for Parity-Based RAID Storage Systems"

Adam Green
6 years ago
Views:

200 6th International Conference on Parallel and Distributed Systems TRIP: Temporal Redundancy Integrated Performance Booster for Parity-Based RAID Storage Systems Chao Jin *, Dan Feng *, Hong Jiang,

1 200 6th International Conference on Parallel and Distributed Systems TRIP: Temporal Redundancy Integrated Performance Booster for Parity-Based RAID Storage Systems Chao Jin *, Dan Feng *, Hong Jiang, Lei Tian *, Jingning Liu *, Xiongzi Ge * * Wuhan National Lab for Optoelectronics * School of Computer Science & Technology, Huazhong University of Science & Technology, Wuhan, China chjinhust@gmail.com, dfeng@hust.edu.cn, j.n.liu@63.com, xiongzige@gmail.com Department of Computer Science & Engineering, University of Nebraska-Lincoln jiang@cse.unl.edu, tian@cse.unl.edu Abstract Parity redundancy is widely employed in RAID-structured storage systems to protect against disk failures. However, the small-write problem has been a persistent root cause of the performance bottleneck of such parity-based RAID systems, due to the additional parity update overhead upon each write operation. In this paper, we propose a novel RAID architecture, TRIP, based on the conventional parity-based RAID systems. TRIP alleviates the small-write problem by integrating and exploiting the temporal redundancy (i.e., snapshots and logs) that commonly exists in storage systems to protect data from soft errors while boosting write performance. During the write-intensive periods, TRIP can reduce the penalty of each small-write request to as few as one device IO operation, at a minimal cost of maintaining the temporal redundant information. Reliability analysis, in terms of Mean Time to Data Loss (MTTDL), shows that the reliability of TRIP is only marginally affected. On the other hand, our prototype implementation and performance evaluation demonstrate that TRIP significantly outperforms the conventional parity-based RAID systems in data transfer rate and user response time, especially in write-intensive environments. Keywords small write; parity redundancy; temporal redundancy; snapshot; log; performance booster I. INTRODUCTION The Redundant Array of Independent Disks (RAID) [] architecture has been popular in the storage systems for decades due largely to its major advantages in high performance and fault tolerance. Among the commonly used RAID levels, RAID0 provides the highest performance. However, it is vulnerable to disk failures, since it does not employ any data protection technique. By maintaining the parity check information within disk arrays, the parity-based RAID systems (e.g., RAID5 or RAID6) can rebuild lost data from certain disk failures. The parity-based RAID systems provide comparable read performance with RAID0 systems. However, they suffer from severe write performance degradation, since they must additionally update the parity blocks upon each write operation. The situation worsens in write intensive environments with small request sizes. Taking RAID5 as an example, when a small write request comes for a data block, first the old content of that data block and the corresponding parity block are read from the disks into the buffer, then the new content of the parity block is generated by the XOR calculation on the fly, and finally this newly generated parity block and the new content of the data block are written onto the disks. Thus, a write to a data block in a RAID5 system incurs as many as four expensive disk IO operations. This problem, called the small-write problem, severely degrades the performance of the parity-based RAID systems. In this paper, we propose a novel RAID architecture, called TRIP, based on traditional parity-based RAID systems. TRIP integrates and exploits temporal redundancy, including snapshots and logging, to overcome the small-write problem of parity-based RAID systems. Temporal redundancy [2] widely exists in storage systems to protect data from soft errors. In contrast to the parity information, which we call spatial redundancy, temporal redundancy is another type of data redundancy that has the ability to restore the files to a previous state along the timeline once data corruption or intrusion occurs. The main idea behind TRIP is that, during write-intensive periods, the TRIP system suspends parity updates and operates like RAID0 while the disk array is being temporarily protected by the temporal redundancy information stored in snapshots or logs. When the system becomes idle or lightly loaded, the delayed parity update operations are performed and the system returns to the normal state, in which the parity is consistent with the data. In most storage systems, the snapshot and logging modules have been integrated as the fundamental functionalities, so the implementation of TRIP would simply make use of these existing modules and thus require minimal modifications to the existing systems. The main contributions of this paper include: () We reveal the fact that there usually exist overlaps between different forms of redundancy in the storage systems, namely, the temporal redundancy can help to recover spatial disk failures if it is configured properly. (2) We propose the TRIP architecture to boost the write performance of the parity based RAID systems. TRIP employs the temporal redundancy generated by the snapshot and logging modules to protect the disk array from disk failures, and suspends the parity update operations to accelerate the write speed during write-intensive periods. (3) We implement the TRIP prototype in the Linux software RAID framework, and evaluate its performance through extensive benchmark-driven and trace-driven experiments. Experimental results show that TRIP significantly increases the data /0 $ IEEE DOI 0.09/ICPADS

2 transfer rate and decreases the request response time compared with the conventional parity based RAID systems. The rest of the paper is structured as follows. The next section gives a detailed description of TRIP. We analyze the reliability of TRIP in Section III and evaluate the performance of TRIP through extensive experiments in Section IV. We review related work in Section V and conclude this paper in Section VI. II. A. TRIP Architecture The architecture of TRIP is shown in Figure, where, in addition to the standard RAID architecture, TRIP incorporates the snapshot and logging modules into the system. The snapshot and logging modules are used to capture and exploit the temporal redundancy. A dedicated snapshot disk, out of the RAID disks, is configured to store the snapshot information. Similarly, a dedicated log disk is used to store the logging information. Generally, TRIP can employ any of the existing and matured snapshot techniques, and we select the Copy-On-Write (COW) technique due to a combined concern of performance and space overhead. Moreover, the COW snapshot has been widely integrated in existing storage systems. For instance, Linux LVM software provides a standard block-level COW snapshot function. On the other hand, TRIP can also use any of the existing logging techniques, such as the sequential logging used by LFS [3], or the track-based disk logging [4]. We choose the sequential logging since it is easy to implement, and the space utilization of the log disk is higher than that of track-based logging. Note that our TRIP is orthogonal to the specific snapshot or logging techniques, and thus it will benefit from more efficient snapshot or logging techniques in the future. B. Process Flow The basic idea behind TRIP is to postpone the parity updates from IO intensive periods to lightly loaded periods to accelerate the write speed. However, due to reliability consideration, TRIP exploits the temporal redundancy provided by the snapshot and log modules to protect against disk failures during the parity-inconsistent periods. Normally, TRIP operates in one of the three states, notated as States A, B, and C. In State A, the TRIP system is idle or under light loads, and runs like a standard RAID. In State B, the system is under heavy loads, and the parity update operations are suspended. State C represents the transitional state from State B to State A, in which the suspended parity update operations are performed, and the system returns to the parity-consistent state. The snapshot and logging functions are activated in State B, and deactivated in States A and C. An idleness detector [5] is used to identify the system s busy and idle periods, and determine when to transit from one state to another. We assume NVRAM is used in the storage controller to protect the metadata structures against power or controller failures. The process flow of the TRIP system in the normal mode (i.e., in the absence of disk failures) is shown in Figure 2. DESIGN AND IMPLEMENTATION OF TRIP Delayed parity update complete Figure. TRIP Architecture. Create snapshot and start logging Figure 2. Normal Process Flow of TRIP. In State C, the suspended parity update operations can be performed in the following two alternative approaches. Reconstruction-write. A parity block is re-computed by performing XOR operations on all the data of the entire parity stripe, where the temporal redundant data is not used. Read-modify-write. For each data block that has been updated in State B (i.e., its original content has been copied to the snapshot), the parity block in the same parity stripe must be updated through the following three steps. First, the original content of the data block and its parity block are read from the snapshot. Second, the current content of the data block is read from the RAID. Third, the parity block is re-computed by performing XOR on the three. It must be noted that, in either approach, only the parity blocks in the modified stripes need to be re-computed. Besides the delayed parity updates, there may also be foreground write requests in State C. Fortunately, the foreground write operations do not compete with the delayed parity update operations, but cooperate with them. If the reconstruction-write approach is used in the foreground write operations, performing a foreground write operation on a parity stripe will make it return to the consistent state, and any delayed parity update operations on it can be cancelled. On the other hand, with the read-modify-write approach, writing a data block that is associated with a delayed parity update operation (i.e., this block has been updated in State B but not in State C) follows the three steps described next. 206

3 () The original content of this data block is read from the snapshot instead of the RAID; (2) The original content of the parity block in this parity stripe is read from the snapshot; (3) The parity block is re-computed by performing XOR on the original content of the data block and the parity block as well as the current content of the data block (contained in the write requests). After that, the delayed parity update operation associated with this data block can be cancelled. This process is illustrated in Figure 3. Pay attention to the fact that writing a data block that is not associated with any delayed parity update operation follows the standard parity update procedure. C. Small-Write Overhead Analysis The primary goal of TRIP is to build a high performance parity-based RAID system to weather through heavy write-intensive workloads. TRIP is based on the traditional parity-based RAID architectures (e.g. RAID5, RAID6), and aims to alleviate the small-write problem inflicted on them. However, in order to maintain the disk-failure tolerance ability of parity-based RAID architectures, TRIP trades the snapshot and logging overhead for high reliability and availability. Thus, the key question is, is the incurred snapshot and logging overhead justified by the reduced penalty of small writes? In the following, we will give a detailed analysis. As mentioned before, a small write to one data block incurs four disk IO operations (two reads and two writes) for RAID5, and no fewer than six operations for RAID6. For the TRIP system, it delays parity updates and runs as RAID0 during busy periods, thus writing a data block incurs only one disk operation. This, however, comes at the additional cost of the snapshot and logging overhead. For the Copy-On-Write snapshot activity, if a data block happens to be overwritten multiple times due to the access locality, the system only copies its original data to the snapshot volume the first time it is updated. This process incurs only two disk operations, a read from the primary data (RAID volume) and a write to the snapshot volume. Furthermore, we can also configure a small write buffer to accumulate the writes to the snapshot volume, and flush the data to the snapshot disk periodically for large sequential writes. Thus, the penalty of the write operations to the snapshot volume can be mitigated. On the other hand, for the sequential logging activity, it is easy to see that its penalty to the system is also negligible, as it does in LFS. In summary, the TRIP system incurs just one disk write and one infrequent disk read for each small write operation, with a minimal cost of sequentially writing the snapshot and log disks periodically. D. Exception Handling In addition to an array of RAID disks, TRIP integrates a snapshot disk and a log disk into the system. It is these two disks that store the temporal redundant data and provide TRIP the opportunity to delay the parity updates for D0 D D " D3 P D D D D D D D 2 P D D" D D D2 Figure 3. In State B, data block D and D 2 are updated to D and D. Their old contents are recorded in the snapshot volume, and their 2 new contents are logged in the log disk. The two records in the snapshot volume indicate two delayed parity updates. In State C, D is updated " again to D, and the parity block is subsequently updated according to " P P D D. After that, the record of D in the snapshot volume is removed, and the delayed parity update associated with D is cancelled. Namely, the foreground parity update and delayed parity update associated with one data block can be performed simultaneously through only one operation. performance improvement under write-intensive workloads. Thus, the state of the two disks is critical to the entire system, and any possible exception on them must be handled properly and immediately. The first possible exception occurs when the log disk becomes full while the system is in State B (See Figure 2). In this case, the system should turn off the snapshot and logging module immediately, and switch to State C. However, the system may still be under intensive workloads, since this is an emergent state switch not guided by the idleness detector. The concurrent normal parity updates and delayed parity updates may degrade the system performance. Thus, to guarantee the foreground application performance, we need to perform the parity updates incurred by foreground IOs prior to the delayed parity updates, and further delay the delayed parity update operations until the idleness detector notifies that the arrivals of idle periods. Fortunately, as mentioned in Section 2.2, the normal parity updates can cooperate with the delayed parity updates due to access locality. As a result, after performing the normal parity updates, many delayed parity update operations can be cancelled. The second possible exception happens when the snapshot disk (but not the log disk) is full while the system is in State B. In this case, we can certainly handle it in the same manner as the first exception. However, there is different and an optimized way to handle this. Pay attention to the fact that the system only needs to write the snapshot disk when a certain data block is overwritten for the first time. Thus, if a data block is overwritten for the first time, the system just execute the standard parity update procedure. Otherwise, it may perform the regular procedure in State B, namely, delay the parity update and log the data in the log disk. In this way, the system need not switch to State C until the log disk is full, or the system becomes idle. D 2 D 207

4 Generally, if the log disk is full, TRIP can perform as well as a regular RAID system; if the snapshot disk is full, TRIP may still perform much better than a regular RAID system. On the other hand, we may increase the capacities of the snapshot and log disks to mitigate such exceptions from happening. The third possible exception occurs when either or both the snapshot disk and log disk fail while the system is in State B. Since the temporal redundant data is damaged, the primary RAID volume is left unprotected, and an arbitrary disk failure will lead to permanent data loss. Thus, in this case we have no choice but to perform an urgent state switch to State C. Furthermore, the delayed parity updates must be executed with higher priority in the reconstruction-write approach, while the IO requests issued by foreground applications are suspended, to ensure that the system return to the parity-consistent state as soon as possible. Generally, the third exception is more urgent than the first two, in that it may degrade the performance of the RAID system significantly, and more importantly, impose a window of vulnerability to RAID. To mitigate such an exception from occurring, we may use the disks with lower failure rate (e.g., SSD and PCM) as the snapshot and log disks. We may also consider using the mirroring (i.e., RAID) technique to improve the reliability of the snapshot and logging modules, though it may incur a slightly higher write overhead. E. Failure Recovery When TRIP is in the parity-consistent state (i.e., State A as shown in Figure 2), it can certainly tolerate disk failures as a regular RAID system does. When the data and parity is inconsistent, TRIP can also recover from disk failures with the help of temporal redundant data. We notate the TRIP system based on RAID5 and RAID6 as TRIP5 and TRIP6 respectively. First we show how TRIP5 can rebuild from one disk failure through the temporal redundant data. The case for TRIP6 is very similar. For RAID5, the entire capacity is divided into multiple parity stripes across the component disks, with each disk contributing one equal-sized block to a parity stripe. Thus, to rebuild the failed disk is tantamount to rebuilding all the blocks in that disk. Now we prove that TRIP5 can recover from either of the following two failure cases. Case : a disk failure occurring in State B. Consider an arbitrary block b in the failed disk. If b has been updated in State B, it can be directly rebuilt by copying its latest version from the log disk. Otherwise, if b has not been updated, it can be rebuilt from the snapshot. In the snapshot, the parity stripes are consistent, thus, block b can be rebuilt through the standard RAID rebuild procedure, namely, executing XOR operations on all the remaining blocks in the same parity stripe. Case 2: a disk failure occurring in State C. The snapshot and logging modules stop working in State C. As shown in Figure 4, the state of the snapshot is modified by each of the update operations, and the modified snapshot is no longer the point-in-time image of the primary data. D0 D0 D P D D D D D D P D D D D D D2 D2 D P D D D D Figure 4. Each record in the snapshot volume indicates a delayed parity update. When a delayed parity update is performed, the corresponding record in the snapshot volume is removed, i.e., the snapshot is modified. The modified snapshot is defined as that, in a parity stripe, if a block does not have a record in the snapshot volume, it is regarded as the snapshot; otherwise, the record is regarded as the snapshot. In this figure, for the first parity stripe, the snapshot is DDDDP; for the second parity stripe, the snapshot is D 0 DDDP. 2 3 It must be noted that the modified snapshot is always in parity-consistent state. If the read-modify-write approach is used in the parity update operations, a certain parity block may be un-updated, partly updated, or fully updated. However, it must be noted that in any case, the modified snapshot is still in parity-consistent state. On the other hand, if the reconstruction-write approach is used, the parity block is always fully updated after an update operation. Consider an arbitrary block b in the failed disk again. If b has been updated in State B (but not in State C), its current content can be found in the log disk, as in Case. Otherwise, if b has been updated in State C, or it has not been updated in either state, it can be rebuilt from the modified snapshot, since the modified snapshot is parity-consistent. As we can see, with the help of the temporal redundancy, TRIP5 can always tolerate one disk failure even when it is in a parity-inconsistent state. Similarly, TRIP6 can also maintain the disk-failure-tolerant ability of a regular RAID6 system, namely, it can always rebuild from two concurrent arbitrary disk failures in the disk array. Take the one-dimensional Reed-Solomon RAID6 system [6] as an example. For each Reed-Solomon parity stripe with n blocks, (n-2) data blocks are encoded into 2 parity blocks, and any two arbitrary blocks in the parity stripe can be reconstructed by the other (n-2) blocks. Suppose two disk failures occur inside a Reed-Solomon based TRIP6 system. Let s denote the two lost blocks in a Reed-Solomon parity stripe as b and b2. If both b and b2 have been updated in State B (but not in State C), they can be rebuilt directly by copying their latest versions from the log disk. If neither b nor b2 has been updated in State B, they can be rebuilt according to the snapshot (in State B) or modified snapshot (in State C). Finally, if only one of them (say, b) has been updated in State B, then b can be rebuilt from the log disk, and b2 can obviously be rebuilt from the snapshot or modified snapshot. 208

5 As for the two-dimensional Parity-Array RAID6 systems, such as EVENODD [7] and P-Code [8], their parity stripes can be regarded as combinations of the one-dimensional RAID5 stripes from different slopes. When double disk failures occur, they can rebuild the lost data through a zigzag-styled reconstruction process, with each step using a RAID5-like parity equation. Thus, the Parity-Array based TRIP6 system can also be rebuilt from two disk failures when they are in a parity-inconsistent state, and their reconstruction process is similar to that of the TRIP5 system. Moreover, we can also build TRIP systems on the basis of erasure-coded storage systems that are capable of tolerating more than two disk failures, such as STAR [9], HoVer [0], and so on. All the TRIP systems maintain the same disk-failure-tolerance abilities as the original RAID systems. Generally, the more disk failures a RAID system can tolerate, the more operational overheads a small write incurs, and thus, the more performance superiority the corresponding TRIP system has over the original RAID system. III. RELIABILITY ANALYSIS In this section, we adopt the Mean Time to Data Loss (MTTDL) metric [, 2] to estimate the reliability of TRIP. TRIP is as reliable as a regular RAID system when it is in the parity-consistent state. In the subsequent discussion of this section, we only estimate the reliability of TRIP when it is in the parity-inconsistent state. However, due to the fluctuating characteristic of the workloads, TRIP would spend a significant portion of time staying in the parity-consistent state. Thus, the overall MTTDLs of the TRIP systems should be much higher than the ones estimated below. We assume that disk failures are independent events following an exponential distribution of rate, and repairs following an exponential distribution of rate. For simplicity, we do not consider the latent sector error in the system model, and leave this as our future research work. According to the conclusion about the reliability of RAID5 [, 2], MTTDL of an 8-disk RAID5 is: MTTDL RAID (8) 2 Figure 5 shows the state transition diagram for a TRIP5 system consisting of an 8-disk RAID5 set, a snapshot disk and a log disk. We assume that the snapshot disk failure and log disk failure follow exponential distributions of rate S and rate L respectively. As mentioned in Section 3.4, when either or both the snapshot disk and log disk fail, the TRIP system must perform the delayed parity update operations immediately and return to parity-consistent state. We assume that this time follows an exponential distribution of rate. State <0> represents the normal state of the system when all the disks are operational. A failure of any of the 8 disks in the RAID set would bring the system to State <>, and a subsequent failure of any of the remaining 7 disks in the RAID set would result in data loss. 8 8 S L 7 S L Figure 5. State transition diagram for TRIP5. The repair transition brings the system from State <> back to State <0>. If the snapshot disk or log disk fails in State <>, it would result in data loss. On the other hand, if the snapshot disk or log disk fails in State <0>, it would bring the system to State <2>. In State <2>, if all the delayed parity update operations are successfully performed, the system would go back to State <0>. Otherwise, if any of the 8 disks in the RAID set fails in State <2>, it would result in data loss. The Kolmogorov system of different equations describing the behavior of the TRIP5 system is: dp0 () t (8 S L) p0( t) p( t) p 2( t) dt dp () t ( 7 S L) p( t) 8 p0( t) dt dp2() t (8 ) p2( t) ( S L) p0( t) dt Where pi () t is the probability that the system is in State i with the initial condition that p 0 (0) and pi (0) 0 for i 0. The Laplace transformation of Equation (2) is as follows. * * * * sp0() s (8 S L) p0() s p() s p2() s * * * sp( s) ( 7 S L) p( s) 8 p0( s) * * * sp2() s (8 ) p2() s ( S L) p0() s Observing that the Mean Time to Data Loss (MTTDL) of the system is given by []: MTTDL pi (0) i According to Equations (3) and (4), we can work out the MTTDL of the TRIP5 system that is composed of an 8-disk RAID5 set, a snapshot disk, and a log disk. Since the resultant expression is rather complex (ratio of two large polynomials), it is not displayed here. We present the computed values in Figure 6 instead. 209

6 shows that the reliability of TRIP is only marginally affected. Thus, the overall MTTDL of TRIP should be comparable with that of a standard RAID system. Figure 6. MTTDL attained by different disk arrays. Figure 6 plots MTTDL as a function of MTTR (Mean Time to Repair) for an 8-disk RAID5, a TRIP5-H system, and a TRIP5-S system respectively. Each TRIP system is composed of an 8-disk RAID5 volume, a snapshot disk and a log disk. TRIP5-H uses traditional hard disks (same as the RAID disks) as the snapshot disk and log disk. TRIP-S uses solid state disks (SSD) as the snapshot disk and log disk. The RAID disk failure rate is assumed to be one failure every fifty thousand hours, which was revealed by recent studies [3]. The snapshot disk failure rate S and the log disk failure rate L are determined by the different disks we use. For TRIP-H, we assume S L. For TRIP-S, the values of S and L are about two hundred thousand hours [4]. The value of is assumed to be twenty-four hours, which is actually overestimated for TRIP. Disk repair times are expressed in days and MTTDLs are expressed in years. From Figure 6, one can see that the MTTDL of TRIP5 is slightly worse than RAID5, by 30% for TRIP5-H and 0% for TRIP-S on average. Generally, both RAID5 and TRIP5 can tolerate exactly one disk failure, but pay attention to the fact that, more disks in a disk array incurs a lower MTTDL, thus, the MTTDL of TRIP5 decreases due to the additional snapshot disk and log disk besides the RAID disks. However, as we know, most deployment of RAID5 is not only for its high reliability, but also for its high performance. Sacrificing an insignificant amount of MTTDL is worthwhile for boosting the I/O performance, especially for write-intensive applications. Besides, it is an interesting fact that the MTTDL of a TRIP5-H system consisting of an 8-disk RAID5 volume, a snapshot disk and a log disk, is even higher than that of a regular RAID5 system with 0 disks. Similar to TRIP5, we can also work out the MTTDLs of TRIP-6 according to its state transition diagram. The reliability model shows that an 8-disk regular RAID6 system has an average MTTDL of 60 years. The MTTDLs of TRIP6-H and TRIP6-S are much lower than RAID6, for just 8 and 679 years on average respectively. However, their MTTDLs are still much better than RAID5, by up to 225% and 850% on average. Moreover, as mentioned before, the above analysis is actually underestimated since it only refers to the parity-inconsistent state of TRIP. Even though, the analysis IV. PERFORMANCE EVALUATION A. Experimental Setup We have implemented a TRIP prototype in the Linux Software RAID framework for its performance evaluation that is conducted on a platform of server-class hardware with an Intel Xeon 3.0GHz processor and GB DDR memory. We use a Marvel MV88SX608 controller card to house 8 SATA disks. A separate IDE disk is used to house the operating system (Linux kernel ) and other software (MD and mdadm). The traces used in our experiments are obtained from the Storage Performance Council [5]. The two financial traces were collected from OLTP applications running at a large financial institution. They represent different access patterns in terms of write ratio, IOPS and average request size, as shown in Figure 9. The trace replay tool used is RAIDmeter [6] that replays block-level traces and is used to evaluate the user response time of storage systems. B. Benchmark-Driven Evaluation Results We first conduct an experiment on different RAID architectures using IOmeter [9] with different access patterns to obtain their performances in data transfer rate (MB/s). Each RAID set has 6 disks, and the stripe unit size is 64KB. The RAID5 and RAID6 controllers are implemented in the Linux MD module, and the RAID6 coding algorithm uses the Reed-Solomon code [6]. From Figure 7(a), we can see that TRIP performs much better than RAID5 and RAID6 for random write requests. Specifically, TRIP outperforms RAID5 and RAID6 by 244% and 406% on average respectively. This is because TRIP eliminates the overhead of reading and writing the parity blocks that severely burdens its two RAID counterparts. On the other hand, TRIP performs slightly worse than RAID0, by 9% on average. This inferiority of TRIP to RAID0 is mainly due to the additional overhead incurred by the snapshot and logging activities of the former. For the sequential write requests, as shown in Figure 7(b), while TRIP still outperforms RAID5 and RAID6 by 229% and 284% on average, the performance gap between TRIP and RAID0 is widened. The reason is that the sequential write transfer rate is much higher than the random write transfer rate and, under the sequential write environment, the single log disk may become a performance bottle neck of the system when the transfer rate increases. One potential solution to this problem is to distribute the log capacity among the RAID disks. As for the read performance, we can see from Figures 7(c) and 7(d) that all the four systems perform relatively closely in both the random and sequential environment. This is consistent with our intuition in that there is no parity update or snapshot and logging activity for the read requests. 20

7 Figure 7. Transfer rate comparison with different access patterns. Figure 9. The trace characteristics. Figure 8. Response time comparison under two Financial traces. C. Trace-Driven Evaluation Results We conduct the second experiment on the four systems driven by the two financial traces. Figure 8 shows their measured performance in term of average response time. From Figure 8, we can see that the average response time of TRIP is lower (and thus better) than RAID5 and RAID6 by up to 22% and 33.5% respectively for Financial, and by up to % and 9% for Finacial2. Pay attention to the fact that TRIP gains a larger performance improvement on Financial than on Financial2. The reason is that Financial has a higher write ratio than Financial2 (See Figure 9), and TRIP is more capable of handling the write requests than the traditional parity-based RAID systems. Compared with RAID0, the average response time of TRIP is about 44% and 25% higher for the two traces respectively. However, unlike TRIP, RAID0 has no ability to tolerate disk failures; thus, the deployment of RAID0 is quite limited in production environments. V. RELATED WORK Parity Logging. Stodolsky et al. [7] proposed a scheme called Parity Logging to eliminate the RAID-5 small-write problem. Upon a write request to a data block, the new content of the data block is written in place, while the XOR result of the old and new content of the data block is recorded in a log disk using sequential logging. Thus, Parity Logging reduces the small-write penalty to two disk IO operations, namely, reading the old content of the data block from the disk, and writing the new content of the data block to the disk. However, the parity-logging scheme has its disadvantage compared with TRIP. Parity Logging simply records every intermediate state of the parity blocks. If a parity block is overwritten multiple times, the recovery chain can become relatively long. When it is time to apply the changes to the parity block, the system must track the log records one by one from the head of the log disk, and restore every historical version of the parity block along the time sequence, even though only the latest version is useful. TRIP differs from Parity Logging in two significant ways. First, TRIP needs only one step to update a parity block to its latest version and, thus, a significant amount of time and memory space can be saved. Second, when Parity Logging performs the delayed parity updates, the foreground write operations must be suspended. However, as shown in Figure 3, TRIP is able to perform the foreground and delayed parity update operations simultaneously, thus enabling the primary data to be always accessible. AFRAID. AFRAID [8] tries to strike a good balance between performance and reliability for RAID5 systems. For certain periods, AFRAID stops parity updates and runs like RAID0 to provide high performance. The stale parity blocks are marked as dirty in a NVRAM-held bitmap. During these periods, the occurrence of one disk failure would lead to permanent data loss. In other periods, the dirty parity blocks 2

8 are updated, and the system returns to the normal RAID5 state. Through managing these two kinds of periods, AFRAID provides a controlled tradeoff between reliability and performance. The major drawback of AFRAID is that it cannot always recover from one disk failure like a standard RAID-5 system. Both Parity Logging and AFRAID maintain the data organization of a standard RAID system, and simply delay or skip parity updates to boost performance. While TRIP shares a similar principle with Parity Logging and AFRAID, TRIP employs several different techniques to ensure both performance and reliability. Beyond these schemes, there are also some other schemes such as Hot Mirroring [20] and Dynamic Striping [2]. Hot Mirroring uses mirroring for frequently updated (hot) data and parity logging for cold data. Dynamic Striping groups several small updates into a full stripe write, and writes it consecutively to a new place (i.e., not overwriting the original data blocks). Both schemes need to change the original data organization of the RAID systems, which may lose the spatial locality of data. Hu and Yang proposed the DCD (Disk Caching Disk) [22] scheme for disk-based storage systems. DCD uses an additional cache disk to collect small updates, and propagates them to their original places later. Since the cache disk is accessed efficiently through large sequential operations, the performance of the entire system can be improved. However, the DCD scheme cannot be applied to RAID5 directly, for all the updates will be lost if the cache disk fails. VI. CONCLUSION In this paper, we present a new RAID architecture called TRIP. TRIP integrates and exploits temporal redundancy, by means of the snapshot and logging techniques that are commonly deployed in storage systems, to significantly alleviate the small-write penalty and boost performance for parity-based RAID systems. During write intensive periods, TRIP delays parity updates and protects the disk array with temporal redundant data, allowing each small-write request to be serviced by as few as one disk write operation with minimal overheads due to the snapshot and logging activities. Thus, TRIP can greatly improve the performance of the parity-based RAID systems while still guaranteeing their disk-failure-tolerance capability. ACKNOWLEDGEMENT This work is supported by the 863 Project 2009AA0A40, 2009AA0A402, the National Basic Research 973 Program of China under Grant No. 20CB302300, 20CB30230, Changjiang Innovative Group of Education of China No. IRT0725, and the US NSF under Grant No. CCF , IIS REFERENCES [] D. Patterson, G. Gibson, and R. Katz, A Case for Redundant Arrays of Inexpensive Disks (RAID), In Proc. of SIGMOD 88, Jun [2] A. Azagury, M. E. Factor, J. Satran, and W. Micka, Point-In-Time Copy: Yesterday, Today and Tomorrow, In Proc. of MSST 02, Apr [3] M. Rosenblum and J. Ousterhout, The design and implementation of a log-structured file system, ACM Trans. On Computer Systems, 0():26-52, Feb [4] T. Chiueh and L. Huang, Track-Based Disk Logging, In Proc. of DSN 02, June [5] R. Golding, P. Bosch, C. Staelin, T. Sullivan and J. Wilkes, Idleness is not sloth, In Proc. of USENIX95, Jan [6] J. Plank, A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems, Software Practice and Experience, 27(9):995-02, 997. [7] M. Blaum, J. Brady, J. Bruck, and J. Menon, EVENODD: An optimal scheme for tolerating double disk failure in RAID architectures, IEEE Transactions on Computers, 44(2):92-202, 995. [8] C. Jin, H. Jiang, D. Feng and L. Tian, P-Code: A New RAID-6 Code with Optimal Properties, In Proc. of ICS 09, New York, NY, June [9] C. Huang and L. Xu, STAR: An efficient coding scheme for correcting triple storage node failures, In Proc. of FAST 05, San Francisco, Dec [0] J. L. Hafner, HoVer erasure codes for disk arrays, In Proc. of DSN 06, Philadelphia, June [] Q. Xin, E. L. Miller, T. Schwarz, D. D. E. Long, S. A. Brandt, and W. Litwin, Reliability Mechanisms for Very Large Storage Systems, In Proc. of MSST03, Apr [2] J. Pâris, T. Schwarz and D. D. E. Long, Self-Adaptive Two Dimensional RAID Arrays, In Proc. of IPCCC07, Apr [3] B. Schroeder and G. A. Gibson, Disk Failures in the Real World: What Does an MTTF of,000,000 Hours Mean to You? In Proc. of FAST07, Feb [4] SanDisk Solid State Driver. [5] OLTP Application I/O. UMass Trace Repository. [6] L. Tian, D. Feng, H. Jiang, and K. Zhou, PRO: A Popularity-based Multi-threaded Reconstruction Optimization for RAID-Structured Storage Systems, In Proc. of FAST07, Feb [7] D. Stodolsky, G. Gibson, and M. Holland, Parity Logging Overcoming the Small Write Problem in Redundant Disk Arrays, In Proc. of ISCA 93, May [8] S. Savage and J. Wilkes, AFRAID: A Frequently Redundant Array of Independent Disks, In Proc. of USENIX 96, Jan [9] IOmeter. [20] K. Mogi and M. Kitsuregawa, Hot mirroring: A method of hiding parity update and degradation during rebuilds for RAID 5, In Proc. of ACM SIGMOD 96, Montreal, Quebec, 996. [2] K. Mogi and M. Kitsuregawa, Dynamic parity stripe reorganization for RAID5 disk arrays, In Proc. of the International Conference on Parallel and Distibuted Information Systems, Austin, TX, Sep [22] Y. Hu and Q. Yang, DCD disk caching disk: A new approach for boosting I/O performance, In Proc. of ISCA 96, Philadelphia, PA, May

RAID6L: A Log-Assisted RAID6 Storage Architecture with Improved Write Performance

RAID6L: A Log-Assisted RAID6 Storage Architecture with Improved Write Performance Chao Jin, Dan Feng, Hong Jiang, Lei Tian School of Computer, Huazhong University of Science and Technology Wuhan National