A Comprehensive Study on RAID-6 Codes: Horizontal vs. Vertical

Size: px

Start display at page:

Download "A Comprehensive Study on RAID-6 Codes: Horizontal vs. Vertical"

Iris Wilkerson
6 years ago
Views:

1 2011 Sixth IEEE International Conference on Networking, Architecture, and Storage A Comprehensive Study on RAID-6 Codes: Horizontal vs. Vertical Chao Jin, Dan Feng, Hong Jiang, Lei Tian School of Computer Science & Technology, Huazhong University of Science & Technology Wuhan National Laboratory for Optoelectronics University of Nebraska-Lincoln chjinhust@gmail.com, dfeng@hust.edu.cn, jiang@cse.unl.edu, ltian@hust.edu.cn Abstract The RAID-6 architecture is playing an increasingly important role in modern storage systems. There are generally two kinds of RAID-6 codes, horizontal codes and vertical codes. Horizontal codes have been extensively studied and widely implemented, while vertical codes have not gained the equal attention. In this paper, we investigate the state-of-theart horizontal and vertical RAID-6 codes and select two representative ones, RDP for horizontal codes and P-Code for vertical codes, to compare their performance. Since the code lengths of vertical codes are usually restricted, we first provide two efficient code shortening algorithms for vertical codes, by which the length of a vertical code can be extended to an arbitrary given one. In the context of our code shortening algorithms for vertical codes, we compare the theoretical performance of RDP and P-Code at consecutive lengths, and examine their practical behaviors in the real environment. Theoretical analysis and experimental evaluation results have demonstrated that vertical codes can provide comparable, and sometimes even better, performance than horizontal codes. Keywords RAID-6; horizontal codes; vertical codes; code shortening; performance comparison I. INTRODUCTION The RAID (Redundant Array of Independent Disks) Architecture [1] has been popular in the storage systems for years due largely to its two major advantages of high performance and fault tolerance. RAID achieves high performance through parallel IO across an array of component disks and provides fault tolerance through data redundancy and erasure codes. The latter, addressing the critical issue of data reliability and availability, has been drawing increasing attention from the academia and industry lately. There are three major reasons behind this. First, recent findings from real world by researchers have reported that partial or complete disk failure rate is actually much higher than previously and commonly estimated [2]. Second, while the number and capacity of disks have been growing exponentially, individual disk failure rates remain largely unchanged [3]. Third, the disk failures usually show a sign of correlations, e.g., after one disk fails, another disk failure will likely occur soon [2]. All these reveal the fact that in today s data centers, where there are thousands of hard disks and two or more concurrent disk failures are no longer rare, the ability to tolerate multiple disk failures becomes ever more important. Among all the RAID levels, RAID-6 outperforms the others in disk failure tolerance due to its ability to tolerate arbitrary two concurrent disk failures in the array. Thus, many storage companies, as well as academia research groups, are conducting active research on RAID-6 codes. RAID-6 codes can be roughly divided into two categories, horizontal codes and vertical codes. Horizontal codes such as Reed-Solomon codes and EVENODD have been studied extensively and implemented widely in storage systems. Several open-source horizontal codes are introduced and evaluated in details in [6]. Their implementations can be obtained through open-source libraries like Jerasure [7]. On the other hand, vertical codes have gained less attention than their horizontal counterparts. Few applications of vertical codes have been seen in the real environment, and their performance behaviors and properties remain largely unexplored. In this paper, we examine the key properties of vertical RAID-6 codes, and compared them comprehensively with those of the horizontal RAID-6 codes. In particular, we choose two most representative codes, RDP [8] for horizontal codes, and P-Code [9] for vertical codes, for detailed and quantitative comparisons. We analyze their computational complexity, update complexity, and storage efficiency for all array sizes within the typical array size range. To reveal their behaviors in a real environment, we implement them in our storage platforms, and evaluate performance with IO benchmark tools. The main contributions of this paper are summarized as follows. We investigate the state-of-the-art horizontal and vertical RAID-6 codes, and select two most representative ones, namely RDP for horizontal codes and P-Code for vertical codes, to compare them comprehensively. Horizontal codes are very easy to be extended by code shortening, while vertical codes not so easy to be shortened. We introduce two effective code shortening schemes for vertical codes by taking P-Code as an example. The two shortening schemes provide flexible design choices for P-Code. The first scheme maintains the MDS (i.e., optimal storage efficiency) property of P-Code, but it may increase the computational complexity and update complexity. The second scheme loses the optimal storage efficiency, but attains the optimal update complexity and even lower computational complexity /11 $ IEEE DOI /NAS

2 We analyze theoretically the key performance metrics of computational complexity, update complexity, and storage efficiency of RDP and P-Code respectively at consecutive array sizes within the typical array size range, and show that P-Code provides comparable, and sometimes even better, performance than RDP. We discuss the design and implementation issues of RDP and P-Code in the context of practical implementations. We implement them in our storage platforms, and measure their performances under different design parameters in a real environment. We show that the design and implementation issues may have a significant impact on the performance of the RAID-6 array, and demonstrate that the theoretic performance analysis is consistent with the practical measurements in general. The rest of the paper is organized as follows. The next section reviews the state-of-the-art horizontal and vertical codes respectively. We discuss the code shortening techniques and present two shortening schemes for vertical codes in Section 3. In Section 4 we analyze the key performance metrics for RDP and P-Code. Section 5 addresses the design and implementation issues when implementing a RAID-6 code in real platforms. We measure and evaluate the performances of RDP and P-Code in practical implementations in Section 6 and summarize this paper in Section 7. II. REPRESENTATIVE RAID-6 CODES The main difference between horizontal and vertical RAID-6 codes lies in the placement of the parity blocks inside their code structures. For horizontal codes, the last two columns are dedicated parity columns, and the other columns are data columns. The first parity column simply holds row parity blocks, and the second parity column is filled with parity blocks constructed via a certain algorithm. For vertical codes, on the other hand, there is no dedicated parity column, the data and parity blocks are spread across all the columns. A. Horizontal RAID-6 Codes Reed-Solomon code is a very powerful general-purpose horizontal code [10]. It is suitable for any array size and can be generalized to tolerate arbitrary number of disk failures. Its key shortcoming is that the second parity column is constructed via finite field arithmetic, which is very computationally intensive. Many efforts have been made to reduce its computational complexity by proposing new codes on the basis of Reed-Solomon code, such as the Cauchy Reed- Solomon code [11]. Besides these general-purpose codes, there are some other codes called special-purpose codes, which can not be easily generalized to tolerate more disk failures. However, these special-purpose RAID-6 codes, such as EVENODD, RDP, and the Liberation codes [12], vastly outperform their general-purpose counterparts. We mainly focus on special-purpose codes in this paper for their ease of implementation and low computational complexity for RAID-6. In the following discussions, p represents a prime number. EVENODD. A standard EVENODD code [5] has (p+2) columns and (p-1) rows. There are p diagonal parity chains across the data columns inside the code structure, each with a diagonal parity block. One of the diagonal parity blocks is XORed into the other (p-1) blocks as an adjusting factor. These adjusted diagonal parity blocks are stored in the second parity column. When rebuilding from double disk failures, it is capable of recovering the original p diagonal parity blocks. But it must be noted that the adjusting diagonal parity chain aggravates the computational complexity during construction and reconstruction, resulting in poor small-write performance. RDP. A standard RDP code [8] has (p-1) rows and (p+1) columns. There are (p-1) diagonal parity chains across the data columns and the first parity column (the blocks that do not appear in any diagonal parity chain form the missing diagonal parity chain). The second parity column holds the (p-1) diagonal parity blocks. RDP is proven to perform the best among the RAID-6 horizontal codes [6]. Liberation code. A standard Liberation code [12] has p rows and (p+2) columns. The last two parity columns are constructed from the multiplication of a specified CDM (Coding Distribution Matrix) and the data vector. In fact, each horizontal RAID-6 code has a corresponding CDM. The CDM of Liberation code has the minimal number of ones, meaning that it has the lowest update complexity among all the horizontal RAID-6 codes. Table 1 presents a characteristic comparison of the abovementioned horizontal codes. The standard array sizes of a code refer to the array sizes (i.e., number of columns) in which the code is originally defined. A code usually performs the best at its standard array sizes. The brief evaluation of computational complexity, update complexity and storage efficiency in Table 1 all refer to the performance of the codes at their standard array sizes. Detailed comparison of the three codes can be found in [12]. All the horizontal codes are easy to be extended (i.e., shortened) to arbitrary array sizes. Generally, RDP performs better than the other horizontal codes. Moreover, RDP has a clear geometrical structure, making it very easy to implement. Thus, we select RDP to be the representative horizontal code to compare with vertical codes. B. Vertical RAID-6 Codes For most vertical RAID-6 codes, one data block participates in the calculation of, and is protected by, exactly two parity blocks. Moreover, all the parity blocks are independent from one another. Thus, they have attained the optimal update complexity and the optimal computational complexity during construction or reconstruction. Some representative vertical codes are X-Code [13], B-Code [14], and P- Code [9]. X-Code. X-Code has a structure of p rows and p columns. The data blocks, held in the first (p-2) rows, are covered by p diagonal parity chains along slope 1 and another p 103

3 TABLE I. CHARACTERISTIC COMPARISON OF REPRESENTATIVE HORIZONTAL CODES Horizontal Standard Computational Update Storage Implementation Extendibility Codes Array Size Complexity Complexity Efficiency Complexity EVENODD prime+2 high high optimal easy low RDP prime, prime+1 optimal high optimal easy low Liberation easy prime+2 low low optimal Code medium TABLE II. CHARACTERISTIC COMPARISON OF REPRESENTATIVE VERTICAL CODES. Vertical Standard Computational Update Storage Implementation Extendibility Codes Array Size Complexity Complexity Efficiency Complexity B-Code all (not proved) optimal optimal optimal n/a high X-Code prime optimal optimal optimal easy low P-Code prime-1, prime optimal optimal optimal easy low diagonal parity chains along slope -1. The parity blocks of the parity chains are stored in the last two rows. B-Code. Xu et al. found the equivalence between the construction of a new kind of RAID-6 code, called B-Code, and the perfect one-factorization of the complete graphs. The structure of B-Code consists of N columns, N/2 rows when N is even or (N-1)/2 rows when N is odd. The existence of perfect one-factorizations for every complete graph with an even number of nodes is a famous conjecture in graph theory [15]. However, the conjecture has not been proved, and the possibility of constructing B-Code with arbitrary array size N can not be affirmed. One disadvantage of B-Code is that, inside its code structure, the data-parity patterns are not regular. So in a practical implementation, it might be necessary to use a table-driven technique that requires a large mapping table to store the data-parity information. P-Code. A standard P-Code has (p-1)/2 rows and (p-1) columns. The first row holds the parity blocks, and the rest (p-3)/2 rows hold data blocks. P-Code has similar structures as B-Code, except that its array length is limited to prime or (prime-1). However, P-Code is easy to be extended. In Section 3, we will introduce two code shortening schemes for vertical codes by taking P-Code as an example. One distinctive difference between P-Code and B-Code is that P-Code uses a very simple and clear algorithm to describe the dataparity patterns in its code structure, and a mapping table similar to B-Code s will not be necessary in any implementation. A characteristic comparison of the abovementioned vertical codes is given in Table 2. All the vertical codes have the optimal computational complexity, update complexity and storage efficiency at their corresponding standard array sizes. B-Code can support continuously all the array sizes within the typical disk-array-size range. However, they require a mapping table that may complicate the implementation, and adversely influence the performance. X-Code and P-Code both have clear geometrical structures, and easy to implement. Moreover, since P-Code has only one, while X- Code has two, parity block in each column, P-Code can be extended to arbitrary array size more easily. We select P- Code to be the representative vertical codes to compare with horizontal codes. III. CODE SHORTENING ALGORITHMS FOR VERTICAL RAID-6 CODES It must be noted that the special-purpose RAID-6 codes usually have array-size limitations, namely, the number of columns in the code structure must be some discrete values (e.g., prime or near-prime). For instance, the array size of the standard RDP code is prime+1, and the array size of the standard P-Code is prime or prime-1. This restriction makes these codes somewhat impractical in the real environment, since the administrator might be expecting to configure a disk array with an arbitrary number of disks. Fortunately, through code shortening, horizontal codes and vertical codes all can be extended to arbitrary array sizes. Shortening horizontal codes is straightforward. For a standard horizontal code, we can simply remove some of the data columns from its code structure, by assuming that the removed data columns contain only imaginary zeros. It has been shown in [8] that RDP can be configured for any array size in this way. On the other hand, vertical codes are not so easy to be shortened. In the structure of a vertical code, each column usually contains not only data blocks but also parity blocks. The problem is that, if we remove one column, the parity blocks in this column should also be removed, and then the corresponding parity chains may be in an inconsistent state. Generally, there are two algorithms to handle this problem. We will illustrate them by taking P-Code as an example. It must be noted that the algorithms are also applicable to other vertical RAID-6 codes such as X-Code [16]. Figure 1 illustrates the two shortening schemes for P- Code. dn denotes the index of the n-th column. The first scheme is inspired by the method proposed in [17]. As shown in Figure 1b, in the structure of the standard P-Code with 6 columns (Figure 1a), when the parity block (6) is removed with column d6, we select a data block in the same 104

4 Figure 1. The two shortening schemes for P-Code. parity chain (e.g., data block (2,6) in column d1) to be the new parity block. The data-parity pattern of the remaining columns is the same as before, namely, the labels of the remaining blocks stay unchanged. The remaining structure is a shortened P-Code with 5 columns. Obviously, the shortened P-Code can also be rebuilt from any two column erasures, and its reconstruction algorithm is very similar to that of a standard P-Code. We can further shorten P-Code by removing more columns in this way. The second scheme is to remove the entire parity chain whose parity block has been removed. As shown in Figure 1c, when the parity block (6) is removed with column d6, we remove all the data blocks in parity chain P(6) (i.e., the data blocks whose labels contain the integer 6). Pay attention to the fact that there is no data block of P(6) in the column d5, and in order to maintain an equal number of rows in each column, we remove the block (2,3) additionally. When rebuilding, we just image the removed blocks still exist but have zero values in them. The resulting structures of the two shortening schemes have different properties. For the first scheme, the number of parity blocks and the number of rows in each column remain unchanged after shortening. A desirable property for a RAID-6 code to have is the Maximum-Distance-Separable (MDS) property that assures the code s storage efficiency optimality (see Section 4.3). The necessary and sufficient condition for a RAID-6 code to be MDS is that the number of parity blocks in its code structure equals exactly twice the number of rows in each column. It is easy to see that the shortened P-Code is still an MDS code, namely, it has the optimal storage efficiency. However, the update complexity of the shortened P-Code becomes non-optimal. In Figure 1b, the new parity block of the parity chain P(6), namely block (2,6), also participates in the parity chain P(2). When a data block, say block (3,6), in the parity chain P(6) is updated, the parity block (3) and block (2,6) should be updated, and since (2,6) is updated, the parity block (2) should also be updated. Thus, the update complexity of the data blocks in the parity chain P(6) is 3, above the optimal of 2. At the same time, the computational complexity of the shortened P-Code is also not optimal. We will examine it in details in Section 4. For the second scheme, the number of parity blocks and the number of rows in each column are reduced after shortening. Generally, when we remove one more column from the structure of P-Code, the number of parity blocks and the number of rows per column each will be reduced by one. P- Code shortened in this way is no longer MDS, since it does not satisfy the aforementioned condition. However, this scheme is still suitable for a practical implementation for several reasons. First, the standard P-Code covers a major part of the typical array size range (e.g., 4-37 disks), and the interval between two standard P-Code array sizes is relatively small, so the shortening from the nearest standard P-Code will not be significant, and the storage efficiency of the shortened P-Code will be within an acceptable factor of the optimal. Second, the capacity of modern disks keeps growing at a steady but fast rate, making the storage efficiency less of a concern for a storage system administrator. Third, the construction and reconstruction computational complexity of this shortened P-Code is even lower than that of an MDS code with the same array size (see Section 4.1). And finally, the shortened P-Code still has the optimal update complexity of 2 (see Section 4.2). IV. THEORETICAL PERFROMANCE METIRCS In this section, we analyze the key theoretical performance metrics of computational complexity, update complexity and storage efficiency of RDP and P-Code based on their code structures. We examine these performance properties in all the array sizes within the typical array size range, including those obtained by code shortening. A. Computational Complexity It has been proven in [9] that the optimal construction complexity for any MDS RAID-6 code, in terms of the average number of XOR operations per data block, is 2-2/(n- 2), and the optimal reconstruction complexity, in terms of the average number of XOR operations per lost block regeneration, is (n-3), where n is the array size. Among the aforementioned horizontal codes, all the codes except RDP fail to attain the optimal construction complexity [9]. RDP performs optimally when its array size n is prime or prime+1 [12]. However, shortened RDP loses this optimality, i.e., its construction complexity is above the optimal. Suppose that the standard RDP has p columns (p is prime). From the structure of RDP we can see that, when we shorten RDP by one column, each of the p-1 horizontal pari- 105

5 ty chains, and p-2 out of the p-1 diagonal parity chains (i.e., except the missing diagonal parity chain), are shortened by one data block. Thus the total number of XOR operations needed in the construction process is reduced by 2p-3, and the total number of data blocks is reduced by p-1. The construction computational complexity of the shortened RDP is therefore, 2( p 1)( p 3) N(2 p 3) (1) ( p 1)( p 2) N( p 1) In the above equation, N is the number of shortened columns (i.e., the array size n is p-n). Similarly, the reconstruction computational complexity of the shortened RDP is, 2( p 1)( p 3) N(2 p 3) (2) 2( p 1) On the other hand, each of the aforementioned vertical codes has its corresponding optimal computational complexity at its corresponding standard array sizes. P-Code attains the optimal computational complexity at the array size of prime or prime-1. However, the shortened P-Code performs differently for the two different shortening schemes. Suppose that the standard P-Code has p-1 columns. For the first shortening scheme, when we shorten P- Code by one column, as shown in Figure 1b, p-2 out of the p-1 parity chains are each shortened by one data block, thus the total number of XOR operations is reduced by p-2, and the total number of data blocks is reduced by (p-1)/2. The construction computational complexity of the shortened P- Code by the first shortening scheme is thus given in Expression (3) below. ( p 1)( p 4) N( p 2) (3) ( p 1)( p 3)/2 N( p 1)/2 Similar as Equation (1), here N is the number of shortened columns (i.e., the array size n is p-n). On the other hand, the reconstruction computational complexity of the shortened P-Code by the first scheme is given in Expression (4). ( p 1)( p 4) N( p 2) (4) ( p 1) For the second shortening scheme, when the standard P- Code with p-1 columns is shortened by one column, as shown in Figure 1c, the data blocks in the last column are removed first, and each remaining column should remove one data block additionally. Thus, when the standard P- Code is shortened by N columns, the total number of data blocks that should be removed from its code structure, denoted R, is given in Equation (5) below. R = N( p 3)/2 + N( p 1 N) (5) Since each data block participates in exactly two parity chains, removing one data block causes two fewer XOR in the construction process. Thus the total number of XOR operations needed in the construction process is reduced by 2R, and the construction computational complexity of the Figure 2. Normalized construction computational complexity for RDP and P-Code. Figure 3. Normalized reconstruction computational complexity for RDP and P-Code. shortened P-Code by the second scheme is shown in Expression (6). ( p 1)( p 4) 2R (6) ( p 1)( p 3)/2 R Similarly, the reconstruction computational complexity of the shortened P-Code by the second scheme is given in Expression (7). ( p 1)( p 4) 2R (7) p 1 N The construction and reconstruction computational complexity of RDP and P-Code, normalized to the optimal complexity of MDS codes, at all array sizes within the typical array size range are shown in Figure 2 and Figure 3 respectively. P-Code first/second stands for P-Code with the first/second shortening scheme respectively. From the figures we can see that all the three codes perform optimally at their corresponding standard array sizes. The computational complexity of RDP and shortened P-Code with the first scheme is very close to the optimal, within only very small factors of the optimal. The computational complexity of the shortened P-Code with the second scheme is below the optimal, and the gap widens as more columns are shortened. 106

6 B. Update Complexity In an erasure-coded disk array, when a data block is updated, the associated parity blocks should also be updated to maintain parity consistency. The update complexity indicates how many parity blocks are associated with a data block on average. The lower the update complexity is, the smaller the write penalty it incurs. It has been known that the optimal (i.e., lowest) update complexity for a RAID-6 code is 2. However, horizontal codes are not able to attain this optimal bound. The reason is that, in their code structures, either the parity blocks are not independent from one another (e.g., RDP), or a data block is associated with more than two parity blocks on average (e.g., EVENODD). It is shown in [12] that the Liberation codes have attained the lowest update complexity among all the horizontal RAID-6 codes, but it is still larger than 2. In the code structure of RDP, the data blocks in the first row parity chain and the missing diagonal parity chain have an update complexity of 2, and the others have an update complexity of 3. It is easy to see that the standard RDP with p+1 columns has the average update complexity as follows. 2 2(2 p 3) + 3[( p 1) (2 p 3)] 2 (8) ( p 1) For the shortened RDP, in each data column, two out of p-1 data blocks have an update complexity of 2, and the other p-3 have an update complexity of 3. Thus the average update complexity for the shortened RDP can be represented by Expression (9) ( p 3) (9) ( p 1) Generally, vertical codes outperform their horizontal counterparts in update complexity. All the aforementioned vertical codes have the lowest update complexity of 2 at their corresponding standard array sizes. P-Code attains this optimality at the array sizes of p and p-1, where p is prime. As for the shortened P-Code, it is easy to see that the P- Code shortened with the second scheme still has the optimal update complexity of 2. However, the P-Code shortened with the first scheme no longer has the optimal update complexity. We have explained the reason with an example in Section 3. It is hard to summarize in a closed-form expression for the update complexity of the shortened P-Code with the first scheme, so we have worked it out manually for all the array sizes from 4 to 37 columns. Figure 4 plots the update complexity of RDP and P- Code within the typical array size range. As previously analyzed, P-Code with the second shortening scheme always has the optimal update complexity. The update complexity of RDP is the highest of the three. Moreover, it increases as the array size increases, with the asymptotic value of 3. On the other hand, the shortened P-Code with the first scheme has the optimal update complexity at its standard array sizes, and its update complexity increases as more columns are shortened. Figure 4. Update complexity for RDP and P-Code. Figure 5. Normalized storage efficiency for RDP and P-Code. C. Storage Efficiency Storage efficiency measures the percentage of data blocks in the code structure of an erasure code. The Singleton formula [18] gives the optimal bound for the storage efficiency of any kind of erasure code. If a code attains the Singleton bound, it is called a Maximum-Distance- Separable (MDS) code, and its storage efficiency is optimal. As discussed before, we have a simple and convenient way to decide whether a RAID-6 code is an MDS code, namely, in the structure of an MDS RAID-6 code, the number of parity blocks equals exactly twice the number of rows in each column. Thus, the optimal storage efficiency for a RAID-6 code with array size n is (n-2)/n. It is easy to see that the standard or shortened RDP is always an MDS code. The standard P-Code and shortened P-Code with the first scheme are also MDS codes. The shortened P-Code with the second scheme is not MDS code. Suppose it is shortened from the standard P- Code with p-1 columns by N columns, it is easy to see that there are (p-1)/2-n rows left in its structure, with the first row holding parity blocks. Thus its storage efficiency is given in Expression (10) below. 107

7 ( p 1)/2 N 1 (10) ( p 1)/2 N The storage efficiency of RDP and P-Code, normalized to the optimal storage efficiency of MDS codes, is shown in Figure 5.We can see that RDP and P-Code with the first shortening scheme always have the optimal storage efficiency. The storage efficiency of the shortened P-Code with the second scheme is not optimal, and it decreases as more columns are shortened from the standard array size. However, it remains within an acceptable factor (e.g., 96%) of the optimal most of the time. V. DESIGN AND IMPLEMENTATION ISSUES So far, we have discussed the theoretical performance metrics of the RAID-6 codes based on their code structures. When implementing the RAID-6 codes in real storage systems, some practical matters, such as the effects of cache and memory, the interactions with the file systems, and the characteristics of the IO workload, will influence the performance of a disk array significantly. In this section, we will discuss the design issues that must be considered in a practical implementation. A. Single-Stripe vs. Multiple-Stripe Implementation Due to the memory management mechanism of the operating systems, such as Linux, the main memory is structured into pages, the size of which is usually an integral power of two bytes (e.g., 4KB). A stripe is an IO buffer that caches a codeword (i.e., parity group of disk blocks) in the memory. A stripe is composed of n strips, where n is the array size, with each strip corresponding to a disk/column. A single-stripe means that each of its strips is composed of one single memory page. On the contrary, a multiple-stripe means that each of its strips is composed of multiple memory pages. For a single-stripe implementation of a RAID-6 code, the number of rows in its code structure must be an integral power of two, due to the size of the memory pages. On the other hand, multiple-stripe implementations do not have this restriction. For instance, a block in a column in the codeword can be directly mapped to a memory page of a strip in the stripe cache. For Reed-Solomon-like codes, since there is just one row in their code structures, single-stripe implementation would be a natural choice. Special-purpose RAID-6 codes, such as RDP and P-Code, can also be implemented as single-stripe mode at all array sizes. For instance, by selecting p=257, RDP can be implemented in the single-stripe mode directly. Pay attention to the fact that shortening RDP will not change the number of rows in its code structure, thus all the array sizes below 257 and above the next smallest prime can be implemented on the basis of 257 through code shortening. On the other hand, P-Code can be implemented in a similar way like RDP by the first shortening scheme. Moreover, as a vertical code, P-Code has another way to implement the single stripe mode, through vertical shortening. For instance, when p=11, the number of rows in P-Code s structure is 5, and it is not suitable for a single-stripe implementation, for it is not a power of 2. However, we can remove the last row by assuming that it contains only zeros, and the remaining structure has 4 rows, satisfying the condition. The advantage of a single-stripe implementation is that, since the individual stripe cache size is small (i.e., single memory page width), there might be more stripe caches available when the total amount of memory cache is limited. Also, this implementation mode allows us to use the existing software and techniques for reading and writing a single stripe, thus simplifying the implementation [8]. However, nearly all the array sizes, except for a few isolated ones, can not be implemented in single-stripe mode straightforwardly, but must be done through code shortening. For instance, if we want to implement a 20-disk RDP array, we have no choice but to shorten from a 257-disk RDP array, since 257 is the smallest prime number above 20 that satisfies 2n+1. As shown in Section 4, shortening a large number of columns may do harm to the performance of the disk array severely. On the contrary, the multiple-stripe mode, while a little more complicated to implement, can be applied to a special-purpose RAID-6 code at all array sizes straightforwardly. Moreover, it allows the vertical codes to lay out the data flexibly on the disks. We will discuss this issue in the next subsection. B. Data Layout A distinctive difference between a horizontal RAID-6 array and a vertical RAID-6 array is the data layout on each component disk. In a horizontal RAID-6 array, the data and parity blocks are placed sequentially on the data and parity disks respectively. While in a vertical RAID-6 array, the data and parity blocks are placed on each disk in an interleaved manner. Thus, under a sequential-access-dominated workload, vertical RAID-6 array may perform worse than horizontal RAID-6 array due to the overhead of frequent head seeks of the former. For a vertical RAID-6 array that implements in the single-stripe mode, a read from the disk to the memory cache always contains both data and parity. This may degrade the read performance since the parity data is considered useless when serving a read request. On the other hand, in a multiple-stripe implementation, the data blocks and the parity blocks can be independently accessed. This feature allows us to flexibly change the data layout on the disks to adapt to the access patterns of the workload. The capacity of a disk array is divided into many chunks, and a chunk is composed of several contiguous parity groups across the component disks. We can further divide a chunk into two regions, the data region and the parity region. The data blocks of all the parity groups in the chunk are placed sequentially in the data region, while the parity blocks are placed in the parity region. By tuning the chunk size, we can change the data layout to adapt to the workload for sequential or random disk accesses. We will 108

8 examine the impact of the data layout schemes in details in Section 6. C. Write Strategy There are generally two kinds of strategy to serve a write request, namely, Read-Modify-Write (RMW) and Reconstruction Write (RCW) [20]. Suppose a write request comes for a data block in a parity group. Under the RMW policy, first, the old content of that data block and the corresponding parity blocks are read into the buffer, second, the corresponding parity blocks are reconstructed in the buffer, and last, the new data and parity blocks are written onto the disks. Under the RCW policy, on the other hand, all the other data blocks in the same parity group are read into the buffer, and then all the parity blocks in the parity group are reconstructed in the buffer, finally the new data block and all the new parity blocks are written onto the disks. RMW works well in the small-random-access environment, where only a minor portion of data blocks in a parity group should be updated at the same time. Its performance is directly related to the update complexity of the underlying RAID-6 code. On the other hand, RCW is suitable for the long-sequential-access dominated workloads. Also, if the update complexity of the RAID-6 code is high, RCW would also be a better choice. The computational complexity of the RAID-6 code has an impact on the performance of RCW, since a parity group always reconstructs all its parity blocks when a write request comes for it. In a practical implementation, we can combine the two strategies together, and dynamically and adaptively select one of them at run time to minimize the write penalty [21]. VI. EXPERIMENTAL EVALUATION We have implemented the two representative RAID-6 codes, RDP and P-Code, in our practical RAID-6 systems. In this section, we conduct extensive experiments on them, to compare their practical performances under different design strategies and access patterns. A. Experimental Setup We have implemented RDP and P-Code by embedding them into the Linux Software RAID (MD) as loadable modules. We carry out the experiments on our storage platform of server-class hardware with Intel Xeon 3.0GHz processor and 1GB DDR memory. We use one HighPoint RocketRA- ID 2220 SATA card to house 8 Seagate ST AS SATA disks. The rotational speed of these disks is 7200 RPM, with a peek transfer rate of 78MB/s. An additional IDE disk is used to hold the operating system (Fedora Core 4 Linux, Kernel Version ) and other software (MD and mdadm). B. Construction and Reconstruction Performance Comparison We configure the RDP and P-Code array each with 8 disks, 2 of which are uses as spare disks. As a reference, the Figure 6. Construction speed of RS, RDP, and P-Code. open-source Reed-Solomon codes included in the Linux Software RAID is also evaluated with the same configuration. In this section we examine their construction and reconstruction performances. The disk array starts the construction process when it is created. This process is also known as RAID synchronization. On the other hand, the disk array starts the reconstruction process when disk failures occur inside the array. In our experiment, we use the set-faulty functionality of mdadm to disable two disks in each of the three disk array, and start their reconstruction threads. Since the construction and reconstruction processes have the same pattern, we only present the measured construction speed for the three codes in Figure 6. From Figure 6 we can see that RDP and P-Code have comparable construction speeds, and both outperform RS. The reason lies in the fact that RDP and P-Code use only XOR operation and have the same computational complexity, while RS uses a much more complicated one, namely the finite field operation. However, the gap is not significant, since our server-class CPU is powerful, making the parity computation less of an obvious bottleneck of the entire system. C. Impact of Data Layout on P-Code We have implemented P-Code in the multiple-stripe mode, and it allows us to flexibly change the data layout on the disks. We have discussed this issue in Section 5.2. In this section, we examine the performance of P-Code under the two different data layout designs, where one separates data blocks and parity blocks into two different regions in a data chunk (denoted Read-con ) while the other does not (denoted Read-int ). User IO requests with different access patterns are generated by IOmeter [19]. Figure 7 shows the read throughput of P-Code under different access patterns from small random access (i.e., 0%seq) to large sequential access (i.e., 100%seq). When the workload is small-random-access dominated, the two schemes perform almost equally poorly. As the workload becomes more sequential, both the schemes perform better, but the scheme that concentrates the data blocks inside a chunk (denoted as Read-con in Figure 7) outperforms the other scheme gradually, which verifies the superiority of judicious data layout discussed in Section

9 Figure 7. Read performance comparison for the two different data layout schemes. Figure 9. Read performance comparison of RDP and P-Code. Figure 8. Write performance comparison for the two different write strategies D. Impact of Write Strategy on P-Code We have discussed in Section 5.3 that the two different write strategies, namely RCW and RMW, may have a different impact on the write performance of the P-Code array. We have separately implemented a RCW version and a version that combines RCW and RMW. Figure 8 shows the write performance of P-Code under the two different write strategies. The hybrid strategy of combining RCW and RMW always chooses the strategy that has the smallest write penalty between the two. Thus, as we can see from the figure, the hybrid strategy outperforms RCW under random workload, and performs comparably with RCW under sequential workload. E. IO Performance Comparison on RDP and P-Code In this section, we compare the read and write throughput of RDP and P-Code under different access patterns. They are each configured in the multiple-stripe mode with the hybrid RCW+RMW write strategy. Figure 9 shows the read performance comparison of RDP and P-Code. Under the sequential workload, RDP performs better than P-Code, and the gap narrows and then reverses as the workload becomes less sequential and more random. This is due to the fact that P-Code always has both data and parity blocks inside a chunk on each disk, and this may slow down the transfer rate of large blocks. Figure 10. Write performance comparison of RDP and P-Code. On the other hand, P-Code has a generally better write performance than RDP, especially under the workload of small random accesses. The reason is obvious. The two codes use RCW under large sequential accesses and RMW under small random accesses. Since P-Code has lower update complexity than RDP, it requires fewer IO operations to update the parity blocks under the RMW strategy. VII. CONCLUSION This paper aims to give a comprehensive comparison between horizontal and vertical RAID-6 codes from both the theoretical and practical aspects. We proposed two efficient code shortening algorithm for vertical codes, and both of them are capable of extending a vertical code to an arbitrary length. In the context of our code shortening algorithms for vertical codes, we compared the theoretical performance of the representative horizontal code RDP and vertical code P- Code at consecutive lengths, and demonstrated that P-Code can provide comparable, and sometimes even better, performance than RDP. Then we discussed the design and implementation issues of RDP and P-Code in the context of practical implementations. We also implemented them in our storage platforms, and measured their performances under different design parameters in the real environment. Experimental results showed that the practical performance behavior is consistent with the theoretic performance analysis in general. 110

10 As a direction for our future work, we plan to apply the erasure codes to the solid state disks (SSD) and storage class memory (SCM). Since SSD and SCM have different physical features with traditional disks, they may require different strategies to boost performance. On the other hand, to fully explore the potential computational ability of modern multicore processor or GPU would also be valuable future work for the high performance of the erasure coded storage systems. ACKNOWLEDGMENT This work is supported by the National Basic Research 973 Program of China under Grant No. 2011CB302301; 863 Project 2009AA01A401 and 2009AA01A402; NSFC No , , ; Changjiang innovative group of Education of China No. IRT0725; and the US NSF Grant IIS , CCF , CNS REFERENCES [1] Patterson D, Gibson G, Katz R. A Case for Redundant Arrays of Inexpensive Disks (RAID). in: Proceedings of the International Conference on Management of Data (SIGMOD'98), Chicago, IL, 1988, [2] Schroeder B, Gibson G. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? in: Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST'07), San Jose, CA, 2007, [3] Pinheiro E, Weber W D, Barroso L A. Failure Trends in a Large Disk Drive Population. in: Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST'07), San Jose, CA, 2007, [4] Plank J S. A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems. Software Practice and Experience, 1997, 27(9): [5] Blaum M, Brady J, Bruck J. EVENODD: An efficient scheme for tolerating double disk failures in RAID architectures. IEEE Transactions on Computers, 1995, 44(2): [6] Plank J S, Luo J, Schuman C D, et al. A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries For Storage. in: Proccedings of the 7th UNENIX Conference on File and Storage Technologies (FAST'09), San Francisco, CA, 2009, [7] Plank J S, Simmerman S, Schuman C D. Jerasure: A Library in C/C++ Facilitating Erasure Coding for Storage Applications. Technical Report CS , Department of Electrical Engineering and Computer Science, University of Tennessee, [8] Corbett P, English B, Goel A. Row-Diagonal Parity for Double Disk Failure Correction. in: Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST'04), San Francisco, CA, 2004, [9] Jin C, Jiang H, Feng D, et al. P-Code: A New RAID-6 Code with Optimal Properties. in: Proceedings of the 23rd ACM International Conference on Supercomputing (ICS'09), New York, NY, 2009, [10] Reed I S, Solomon G. Polynomial codes over certain finite fields. Journal of the Society for Industrial and Applied Mathematics, 1960, 8(2): [11] Plank J S, Xu L. Optimizing Cauchy Reed-Solomon codes for faulttolerant network storage applications. in: Proceedings of the 5th IEEE International Symposium on Network Computing Applications (NCA'06), Cambridge, MA, 2006, [12] Plank J S. The RAID-6 Liberation Codes. in: Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST'08), San Jose, CA, 2008, [13] Xu L, Bruck J. X-Code: MDS array codes with optimal encoding. IEEE Transactions on Information Theory, 1999, 45(1): [14] Xu L, Bohossian J, Bruck J, et al. Low-density MDS codes and factors of complete graphs. IEEE Transactions on Information Theory, 1999, 45(6): [15] Wagner D. On the perfect one-factorization conjecture. Discrete Mathematics, 1992, 104(2): [16] Jin C, Feng D, Liu J. Extending and Analysis of X-Code. Journal of Shanghai University (English Edition), 2011, 15(3): [17] Bohossian V, Bruck J. Shortening Array Codes and the Perfect 1- Factorization Conjecture. in: Proceedings of the IEEE International Symposium on Information Theory (ISIT 06), Seattle, WA, 2006, [18] Blaum M, Roth R M. On lowest density MDS codes. IEEE Transactions on Information Theory, 1999, 45(1): [19] IOmeter. [20] Jin C, Feng D, Jiang H, et al. TRIP: Temporal Redundancy Integrated Performance Booster for Parity-Based RAID Storage Systems. in: Proceedings of the 16th International Conference on Parallel and Distributed Systems (ICPADS'10), Shanghai, China, 2010, [21] Jin C, Feng D, Jiang H, et al. RAID6L: A Log-Assisted RAID6 Storage Architecture with Improved Write Performance. in: Proceedings of the 27th IEEE Symposium on Massive Storage Systems and Technologies (MSST'11), Denver, CO,

RAID6L: A Log-Assisted RAID6 Storage Architecture with Improved Write Performance

RAID6L: A Log-Assisted RAID6 Storage Architecture with Improved Write Performance Chao Jin, Dan Feng, Hong Jiang, Lei Tian School of Computer, Huazhong University of Science and Technology Wuhan National