LaRS: A Load-aware Recovery Scheme for Heterogeneous Erasure-Coded Storage Clusters

Size: px

Start display at page:

Download "LaRS: A Load-aware Recovery Scheme for Heterogeneous Erasure-Coded Storage Clusters"

Melissa Daniel
5 years ago
Views:

214 9th IEEE International Conference on Networking, Architecture, and Storage LaRS: A Load-aware Recovery Scheme for Heterogeneous Erasure-Coded Storage Clusters Haibing Luo, Jianzhong Huang, Qiang

1 214 9th IEEE International Conference on Networking, Architecture, and Storage LaRS: A Load-aware Recovery Scheme for Heterogeneous Erasure-Coded Storage Clusters Haibing Luo, Jianzhong Huang, Qiang Cao and Changsheng Xie Wuhan National Lab. for Optoelectronics, Huazhong University of Science and Technology, Wuhan, China {luohaibinghust@gmail.com, Corresponding Author:hjzh@hust.edu.cn} Abstract To reduce the probability of data unavailability, it is extremely important to quickly recover failed data in a (k+r,k) erasure-coded storage cluster. In practice, storage nodes in a large-scale storage system have various network bandwidths and I/O capabilities, therefore, the heterogeneity of storage systems increases along with the growing scale. Both traditional recovery scheme and Fastest recovery scheme simply retrieve k surviving blocks from k surviving nodes, thereby resulting in low recovery performance in a heterogeneous storage cluster. In this paper, we propose a Load-aware Recovery Scheme (LaRS) for heterogeneous RS-coded storage clusters. LaRS not only takes into account both the heterogeneity and load of nodes, but also enables all surviving nodes to service reconstruction reads. The amount of surviving blocks retrieved by a surviving node depends on its load weight which is determined by both network bandwidth and I/O capacity. More blocks are fetched from faster nodes, and vice versa. The three recovery schemes are implemented on a 9-node heterogeneous RS-coded storage cluster, where a set of comparative experiments are conducted. The experimental results show that our LaRS scheme outperforms the other two schemes by a factor of up to Keywords-Heterogeneous storage; erasure codes; recovery scheme; I. INTRODUCTION Storage clusters provide a scalable platform for storing massive data over multiple storage nodes, for example, they have been deployed in cloud storage (e.g., GFS II [1] and Microsoft Azure Storage (WAS) [2]). Since node failures are common, data redundancy mechanisms (e.g., replication [3][4] [5] or erasure coding [6][7][2]) are used to ensure data availability. Because of its higher fault tolerance with the same storage overhead, (k+r,k) Reed Solomon (RS) codes are used by realworld storage clusters (e.g. GFS II and WAS). To maintain the overall system reliability, it is important to rapidly recover failed node, i.e., to reconstruct the lost data in failed node by retrieving required blocks from surviving nodes. Over the last decade, the scale of storage systems grows rapidly due to the data explosion. For example, the amount of the Microsoft s servers in 213 has reached one million according to Microsoft CEO Steve Ballmer, and one of the data centers built by Google has more than 45 thousands servers. Thus, it is very difficult to maintain the homogeneity of servers in such a large scale. Especially, these servers may come from different vendors, and the storage nodes from the same vendor may have different capabilities after system upgrades. Heterogeneity of storage nodes has reverse effect on node recovery within the traditional recovery solution (TRS), where only k surviving nodes are involved in the node recovery. Although some recovery solutions are proposed for erasure-coded storage systems [8][9][1], they usually assume that all disks or nodes are homogeneous. Actually, both the transmission bandwidth and I/O capability of nodes are usually different. The study of efficient recovery from node failures in a heterogeneous RS-coded storage system is in its infancy. We propose a Load-aware Reconstruction Scheme - LaRS - for heterogeneous erasure-coded storage systems. The key design idea of LaRS is to reduce the reconstruction time based on the load of all surviving nodes when one or more failures occur in a heterogeneous distributed storage system. Specially, fewer (or more) blocks are retrieved from the slower (or faster) surviving storage nodes, thus minimizing the negative impact of slow storage nodes on the total recovery time. Our main contribution is twofold. First, we propose an efficient recovery strategy that retrieves surviving blocks from all surviving nodes and the load of a surviving node determines the amount of surviving blocks retrieved from the surviving node. Second, we demonstrate the LaRS scheme in a quantitative manner and validate its precedence by prototyping it in a heterogeneous erasure-coded storage cluster. The remainder of this paper is organized as follows. In Section II, we present the related work. Section III details the concept and design of LaRS. We quantitatively analyze LaRS in Section IV. Finally, we conclude our work in Section VI. II. RELATED WORK A. Erasure-Coded Storage Clusters Besides replication, erasure codes is another redundancy mechanism for storage systems. In a (k+r,k) RS-coded storage cluster (see Fig. 1), where both k original data blocks and r parity blocks generated via the erasure codes are exclusively stored in the storage nodes. Any k blocks in a stripe can recover r original blocks. As a result, data stored on the RS-coded storage cluster are protected against simultaneous failures of up to r nodes. RS codes are Maximum Distance Separable (MDS), which means that if any r of k+r nodes fails, the /14 $ IEEE DOI 1.119/NAS

2 failed data can be recomputed from k surviving disks [11]. We investigate our revovery scheme for clustered storage based on RS codes, which are also used in many in-production storage systems, including GFS II [1] and Microsoft Azure storage [12]. Data Nodes Data Blocks DN 1 DN 2 DN k PN 1 PN r D 1,1 D 1,2 D 1,k P 1,1 D 2,1 D 2,2 D 2,k Client Switch P 1,r P 1,1 P 1,r Manager Parity Nodes Parity Blocks Figure 1. Architecture and data layout of (k+r, k) RS-coded storage clusters. Commonly, (k+r, k) RS coding represents that there are k+r encoded blocks in a stripe, including k information blocks (namely data blocks) and r redundancy blocks (namely parity blocks). RS codes use Generator Matrix and Galois Field arithmetic in both encoding and decoding. Generator Matrix consists of a k k Identity Matrix and a k r Redundancy Matrix. A potentially limitless sequence of encoded blocks can be generated from data blocks. Due to Galois Field multiplications, RS codes are much more expensive in computation than parity array codes (e.g., RDP [13], EVENODD [14], etc.), which only involve XOR calculations. As mentioned in [15], it is not CPU computation but network latency and storage bandwidth play a dominated role in access performance for networked RS-coded storage systems. B. Recovery Schemes Most of existing recovery schemes are designed for RAID systems. There are mainly two categories: (1) Based on the assumption that the disks are homogeneous. For example, Xiang et al. [9] and Wang et al. [8] show the optimal singlenode failure recovery solutions for RDP and EVENODD, and both schemes utilize the redundancy property of XOR-based erasure codes to optimize the recovery, without considering the heterogeneity and load of disks. Although Wu et al. [16] enable a new RAID-6 code to evenly distribute read/write requests among surviving disks, the key premise is that all surviving disks have the same I/O capabilities. (2) Taking the load of disks into account. For instance, Tian et al. [17] design a popularity-based reconstruction optimization scheme (PRO) for RAID-structured storage systems. The PRO scheme reorders recovery I/Os by firstly rebuilding frequently read data, thus minimizing the amount of read requests incurred by rebuilding process. Zhu et al. propose a cost-based heterogeneous recovery (CHR) [18] and a replace recovery algorithm [19] for distributed storage systems based on RAID-6 codes. Both recovery solutions are mainly aimed at minimizing the total recovery cost for a single-node failure by taking advantage of the property that there are multiple recovery chains for a single-node recovery in double-fault-tolerant RAID-6 codes. Different from the CHR scheme that addresses node heterogeneity by periodically choosing a suitable recovery path, our LaRS scheme maximizes the utilization of all surviving nodes by retrieving more surviving nodes from faster surviving nodes. In this work, we present a load-aware recovery scheme that schedule the recovery read requests according to the surviving node s weight, which is varied along with both the heterogeneity and load of each surviving storage node. To our knowledge, this is the first work that explicitly seeks to optimize failure recovery for heterogeneous distributed storage system with Reed Solomon codes. III. DESIGN OF LARS In this section, we present the main idea and design of LaRS, and discuss why it is suitable for heterogeneous erasure-codesbased storage clusters. To simplify our discussion, we focus on the single-node failure recovery problem, but it is apparent that LaRS is also suitable for multi-node failure recovery problem. We consider a (n=k+r, k) RS-coded storage cluster. As shown in Fig. 1, the storage system has n storage nodes and a manager which are connected with local area network (LAN). Encoded data is striped across the storage nodes to enable high I/O throughput. In a stripe, k data blocks and r parity blocks are exclusively stored on k data nodes (DNs) and r parity nodes (PNs). Now let data node DN i be the failed node, where (1 i k). Our goal is to recover all lost data of DN i. According to RS codes, if we want to recover a data block on a storage node, we should read k blocks that are in the same stripe with the lost data block from any k surviving nodes. However, some problems may appear under the heterogeneous environment: First, some surviving nodes which are of low transmission bandwidths and I/O capacity may delay data transmission, thus enlarging the total reconstruction time; Second, it leads to a waste of resources since some surviving nodes are not involved in the reconstruction process. The basic idea of LaRS is to retrieve required surviving blocks from all surviving nodes, and to retrieve fewer (or more) blocks from the slower (or faster) surviving nodes to reconstruct the failed data. Unlike the traditional reconstruction scheme, LaRS will retrieve surviving blocks from multiple stripes and recover multiple failed blocks on the failed node. When a storage node fails, the manager first obtains the 169

3 I/O response time of each surviving node which are used to calculate the weight of each surviving node, then it determine the number and location of surviving blocks provided by a surviving node according to its weight. The rebuilding node retrieves the associate surviving blocks based on the result given by the manager and then reconstructs the lost data. The above procedure is repeated until all data in the failed node is fully recovered. As above-analyzed, the LaRS scheme has two advantages:(1) maximizing the utilization of all surviving nodes; (2) make the transmission time of each node more balanced so as to reduce unnecessary waiting time. Although the retrieved surviving blocks in the disk of a surviving node may be not contiguous since they belong to different stripes, the disk access overhead does not become the performance bottleneck of node reconstruction. The reason is that the surviving-blockreceiving step occurred in the rebuilding node dominates the total reconstruction time. Therefore, LaRS design should address the following two aspects: (1) an effective algorithm that determines bitmap BM[N d ][N] according to the weight set W [1N] ; (2) a method to update the weight of each surviving node. Table I lists symbols that will be used later. Symbol Table I. Symbol and Annotation. Annotation k, r Number of data blocks and parity blocks in a stripe N Number of surviving nodes in the storage cluster SN i The i th surviving storage node, 1 i N W i The weight of node SN i W [1N] The set of weight W i; W [1N] ={W 1,W 2,, W N } T A i Average response time of SN i within a sliding window L i The load status of SN i L Upper The upper bound of L i for updating weight W i of SN i L Lower The lower bound of L i for updating weight W i of SN i N blk,i # of blocks fetched from SN i to the rebuilding node T R The total recovery time T U The weight updating interval T r,i Time spent to retrieve a block from SN i W [1k] The set of the biggest k values in weight set W [1N] BM[N d ][N] Bitmap of surviving blocks used to reconstruct failed blocks A. Effectiveness Explanation The total recovery time T R is of much importance for a storage system. If we can reduce T R it means that we can improve the reliability and availability of the storage system. So our goal is to reduce T R. Recovery process including three steps: retrieving surviving blocks from surviving nodes, calculating the failed data blocks, and writing the recovered blocks. Since the three steps can be carried out in a pipelining manner, the total recovery time T R is restricted by the slowest step. In a heterogeneous cluster, storage nodes have various I/O capacities and transmission bandwidths, some of them may have very low I/O capacity and transmission bandwidth. Comparatively, computing is not the performance bottleneck commonly. Given all of this, usually, the step of retrieving surviving blocs is the performance bottleneck of the recovery process, thus we can get Eq. (1) as below: T R = AD Max N i=1{n A blk,i T r,i}; (1) D A D is the total amount of data that should be recovered in the failed node. A D represents the amount of data that the rebuilding node recovers in a recovery process according to bitmap BM[N d ][N]. T r,i is the time spent to retrieve a block from SN i. Max N i=1 {N blk,i T r,i } is the time spent to retrieve all the required blocks used to recover the A D data in a recovery process. Here, we focus on reducing the value of Max N i=1 {N blk,i T r,i } so as to speed up the recovery process. Two steps are involved in recovering the A D data in a recovery process. (1) to retrieve the needed data; and (2) to calculate the failed blocks. The former setp needs ro retrieve required blocks from all the surviving nodes. The time spent to download blocks from a surviving node includes two parts: T latency and T transfer. The surviving node spends time T latency to read local blocks, which are then transferred to the rebuilding node within time T transfer. The transmission bandwidth of surviving nodes may fluctuate widely in a heterogeneous cluster, so time T transfer dominates the access latency for slow nodes.meanwhile, there may be a number of surviving nodes that have low-speed I/O systems or outdated disks,and the I/O latency T latency and of these nodes may be several times of that of others. Furthermore, the transfer time T transfer changes quickly along with load fluctuations, and the value T transfer of a surviving node may be an order of magnitude larger than that of another surviving node. Time T r,i is affected by two factors: network transfer and disk access. If the available network bandwidth of the i th surviving node is low, the rebuilding node will slowly receive the surviving blocks. If the disk in the surviving node suffers a heavy I/O request (i.e. overload I/Os), then the disk access time will increase. It is apparent that the total time of a recovery process is determined by the slowest storage node. Therefore, if we can reduce time T r of some slower nodes, we can reduce time T R. By virtue of LaRS, we retrieve fewer (or more) blocks from the surviving storage nodes which are slower (or faster). Therefor,all surviving nodes (slow or fast) spend similar time to transmit a specific number of local blocks so as to reduce the unnecessary waiting time of the rebuilding node.there by, time T R can be minimized. B. LaRS The manager periodically monitors the condition of the storage cluster. When a storage node fails, the manager records the 17

4 completion time of retrieving blocks from each surviving node, and calculates the weight of all surviving nodes according to the recorded completion times. The weight W i of the i th surviving node can be calculated according to Eq. (2). W i = 1/(T r,i 1) (2) T r,i is the average completion time of downloading a block from SN i. Weight represents the ability to transmit data of SN i, and the larger W i indicates that node SN i has higher transmission performance. To conventiently computing the value of weight W i, we multiply the time T r,i by a constant value (e.g., 1). With the weight set W [1N] we can use the algorithm to calculate the result including the number and location of the data blocks that the rebuilding node should read from the surviving nodes. We let BM be a N d N two-dimensional matrix that identifies the number and location of the blocks retrieved from surviving nodes, such that the element BM[i][j] (for i N d -1, j N-1) is initialized to, and will be set to 1 if the block with BM[i][j] is going to be retrieved. Then we calculate the result according to W [1N] as follows: We first check the number of the elements greater than in set W [1k] ; If the number is greater than or equal to k, we will select the biggest k elements from W [1k], and make each of them reduced by 1, and set the corresponding elements in array BM[N d ][N] to be 1 ; Otherwise, the iteration will be terminated. N d = Σ N i=w i/k (3) of array BM denotes a storage node and a 1 in the column represents that a block in the corresponding node should be retrieved to the rebuilding node. A row of BM represents a stripe. All the blocks tagged with 1 in a row can be used to reconstruct a failed block. In this way, the rebuilding node can retrieve surviving blocks according to the array BM[N d ][N] to recover the lost data. C. Weight Updating Through the analysis of the previous section we know the weights of storage nodes are very important for recovery scheme. Considering that the load status of the storage nodes is not constant during the recovery, the weight of each storage node should be dynamic. For instance, if user s I/O requests for a storage node are frequent for a period of time, it may cause the increasing of load and then leads to a reduction of the available network bandwidth of this node. With regard to this, it is desirable to propose an algorithm to update the weights of storage nodes during the recovery. T A i is the average response time of SN i within sliding window and thus it has a positive correlation with load status, while it is not enough to judge whether the weight W i should be updated or not. So it is required to adopt a parameter to indicate the load status of a surviving node. T A a is the average response time of all the storage nodes within sliding window. We can use the ratio L i between T A i and T A a to reflect the load status of i th surviving node SN i. L i = T A i/t A a (4) Algorithm 1: Determining the Bitmap of Surviving Blocks // W[1N] denotes the number of elements greater than in the weight set W [1N] Input: k: number of data nodes in the storage cluster N: number of surviving nodes in the storage cluster W [1N] : The weight set of all storage nodes Output: BM[N d ][N]: Bitmap of surviving blocks used for recovery BM[N d ][N]={}, j= // Initializing while W [1N] k do Select the biggest k elements from W [1N] to form W [1k] ; foreach W i W [1k] do W i -- ; BM[j][i]=1; end j++; end return BM[N d ][N] Algorithm 1 gives the pseudo-code to generate the array BM[N d ][N] according to the weight set W [1N]. A column We set two parameters: L upper and L lower. L upper is the upper bound of the load status and L lower is the lower bound of the load status. Once the load status L i of SN i exceeds the L upper, it means that this storage node is overloaded than other nodes and we should reduce the W i. However, if L i is less than L lower, it indicates this storage node has a lower workload and W i should be increased; Otherwise W i remains unchanged. In order to keep the recovery process steady W i should be increased or decreased 1 at a time.since the I/O loads change quickly in LAN-based storage clusters, we set the updating interval to be 1 second. Algorithm 2: Updating weights of storage nodes for SN i {SN i } do if L i > L upper then W i --; else if L i < L lower then W i ++; end end return W [1N] 171

5 Algorithm 2 describes the procedure of updating weights of storage nodes. This procedure is triggered periodically to adjust the weight W i, thus the LaRS scheme can maximize the utilization of the surviving nodes. Both parameters L upper and L lower usually are associated with the configuration of a specific storage cluster and should be determined through concrete experiments. SN 1 SN 2 SN 3 SN 4 SN 5 SN 6 SN 7 SN 8 Figure 3. The data distribution graph of LaRS, where the number of blocks retrieved from i th (i {1, 2,, 8}) surviving node is {6, 6, 6, 6, 5, 3, 3, 1}, respectively. D. Example We demonstrate LaRS using a concrete example, and illustrate how it improves the recovery performance over the traditional and Fastest recovery schemes. Fig. 2 shows a storage system in which 9 storage nodes are connected via a switch. Each node is characterized by a average response time. The storage system uses RS codes and has 6 data nodes and 3 parity nodes. Each storage node stores N b data blocks of size S b. We assume a data node fails. We calculate W [1N] according to Eq. (2). The weight of each surviving node is shown in table II. The total recovery time can be calculated using Eq. (1). Failed Node x Rebuilding node ms 47ms SN 8 SN 1 1ms 33ms SN 7 Switch 1ms SN 2 12ms 33ms 16ms SN 6 2ms SN 3 SN 4 SN 5 SN 8 = 8 th Surviving Node 2) Traditional Recovery Scheme (TRS): In TRS, surviving blocks are read from k random surviving nodes. Assume that TRS accomplishes the reconstruction using the surviving blocks in nodes {SN 1, SN 2, SN 3, SN 4, SN 5, and SN 8 }. Similarly, with Eq. (1), the total recovery time T R,TRS is equal to N b S b 6 47/(6 S b )=47 N b. SN 1 SN 2 SN 3 SN 4 SN 5 SN 6 SN 7 SN 8 Figure 4. The data distribution graph of TRS, where the number of blocks retrieved from i th (i {1, 2,, 8}) surviving node is {6, 6, 6, 6, 6,,, 6}, respectively. 3) Fastest Recovery Scheme (FastestRS): In FastestRS, the fastest k surviving nodes are selected to provide surviving blocks. In this case, the fastest 6 nodes are nodes {SN 1, SN 2,.., SN 6 }. So the total recovery time T R,FastestRS is N b S b 6 33/(6 S b )=33 N b. Figure 2. A 9-node heterogeneous storage cluster. Table II. The Storage Nodes Weights. Node SN 1 SN 2 SN 3 SN 4 SN 5 SN 6 SN 7 SN 8 T r,i (ms) Weight SN 1 SN 2 SN 3 SN 4 SN 5 SN 6 SN 7 SN 8 Figure 5. The data distribution graph of FastestRS, where the number of blocks retrieved from i th (i {1, 2,, 8}) surviving node is {6, 6, 6, 6, 6, 6,, }, respectively. As analyzed above, we observe that the recovery time of LARS is lowest, and it is 66% and 5% lower than that of TRS and FastestRS, respectively. 1) LaRS scheme: LaRS firstly calculates the array BM, which indicates the amount and location of surviving blocks. In this example, we can demonstrate that after the rebuilding node reads 6 blocks from nodes {SN 1, SN 2, SN 3, and SN 4 }, 5 blocks from node SN 5, 2 blocks from nodes SN 7 and SN 8, and 3 blocks from node SN 6, it can reconstruct 6 failed blocks. This process continues until all blocks in the failed node are recovered. According to Eq. (1), the total recovery time T R,lars is equal to N b S b 3 33/(6 S b )=15.5 N b. IV. PERFORMANCE EVALUATION In this section, we comparatively experiment different recovery schemes within a heterogeneous networked storage cluster. We evaluate three recovery schemes in the case of single node failure:(1) the TRS scheme, which fetches data blocks from the random k surviving storage nodes; (2)the FastestRS scheme, which fetches data blocks from the fastest k surviving storage nodes; and (3) our LaRS scheme, which fetches data blocks according to the performance of each surviving storage node. 172

6 A. Experimental Setup Our experiments are conducted upon on a commodity-based erasure-codes cluster that consists of 9 storage nodes. All the nodes are connected through a Cisco switch. Each storage node contains an Intel(R) 3.2GHz CPU, 4GB DDR3 main memory, and Intel G41 Chipset Mainboard with 1Gbps Ethernet NIC. All disks in storage nodes are West Digital Release of Enterprise WD12FBYS SATA2. The operating systems of the storage nodes are Ubuntu 1.4 X86 64(Kernel ). The testbed is configured to have a heterogeneous setting, as we use tool Traffic Control (TC) [2] to configure Ethernet NIC and tool hdparm [21] to configure disks on different storage nodes with various parameters. B. Methodology Evidence shows that a configuration of r=3 achieves a sufficiently large MTTDL for archival storage systems. Therefore, we reserves three storage nodes as parity ones in our tests. In particular, coding parameters k=6 and r=3 are adopted. To examine the impact of block unit size on the recovery performance, we deploy different different block sizes {16KB, 32KB, 64KB, 128KB, and 256KB} for all the three recovery schemes. We use tools TC and hdparm to adjust storage nodes using various parameters, thus making the storage nodes have different response times. As a result, two heterogeneous scenarios which simulate real-world application environments are reached: (1) Scenario A simulates the environment where half of the surviving nodes in the cluster are very fast while the remaining ones are slow; (2) scenario B imitates the environment where there are different types of hardware. Fig. 6 shows the average completion time of downloading a block of the 8 surviving nodes. Table III. The Storage Nodes Weights in the Scenarios. Node DN 1 DN 2 DN 3 DN 4 DN 5 PN 1 PN 2 PN 3 Scenario A Scenario B In experiments, we assume that the volume of each node is 1GBytes. We let the reconstruction time to be the performance metric of the reconstruction schemes, and present the average results of reconstruction time over the five runs for each test. We implement the three recovery schemes (i.e., TRS, FastestRS, the LaRS) in an application-level recovery program. The recovery program is running on the rebuilding node. The decoding operation employs Jerasure library [22] which is the fastest RS code supportting bit-matrices. C. Experimental Results and Analysis After intensive experiments, we find that the LaRS scheme performs well when parameters L upper and L lower are in Average Completion Time(ms) Average Completion Time(ms) DN1 DN2 DN3 DN4 DN5 PN1 PN2 PN (a) Scenario A DN1 DN2 DN3 DN4 DN5 PN1 PN2 PN3 (b) Scenario B Figure 6. The characteristics of two application scenarios. range of [1.5, 1.7] and [.4,.6], respectively. In addition, the updating interval of 1 second is accurate to reflect the variance of I/O loads in the LAN-based storage cluster. Therefore, we respectively set the parameters L upper, L lower and T U to 1.6,.5, and 1 in the latter tests. We first evaluate the recovery performance of different recovery strategies in both scenarios. As shown in Fig. 7, both FastestRS and LaRS outperform TRS, because FastestRS and LaRS schemes take into account the heterogeneity of the storage cluster to improve the recovery performance while TRS read surviving blocks from k randomly-chosen surviving nodes. Furthermore, LaRS outperforms FastestRS in both scenarios under different block sizes. The reason lies in the fact that the FastestRS scheme simply reads data from the fastest k nodes, and LaRS further improves the read performance by exploiting the heterogeneity among the storage nodes. From Fig. 7(a), we know LaRS achieves the highest reconstruction performance. Specially, LaRS speeds up the recovery time of TRS and FastestRS by a factor of up to 1.58 and 1.52 under scenario A when the block size is 256KB, respectively, because LaRS makes good use of all surviving nodes and assignment task according to the capacity of each node. On the other hand, the reconstruction performance of TRS is slightly better than that of FastestRS, this is because half of the surviving nodes are very slow in scenario A so that both TRS and FastestRS choose some slow surviving nodes during reconstruction. Under scenario B, the cluster contains different types of 173

7 surviving nodes. That is, all storage nodes have different transmission speeds and I/O capacity. As shown in Fig. 7(b), compared to the TRS, the total recovery times for FastestRS and LaRS decrease by up to 14% and 42%, respectively. And there exists a significant performance distinction between TRS and FastestRS, because each storage node has different performance and FastestRS can choose the fastest k nodes while TRS may choose some slow nodes. Meanwhile LaRS gets a shorter recovery time than the results of scenario A because LaRS can make the workload on each surviving node more balanced under scenario B. From Figs. 7(a) and 7(b), we observe LaRS always has the best reconstruction performance under both scenarios (i.e., scenario A and scenario B). FastestRS performs better than TRS under scenario B, while both FastestRS and TRS have close performance under scenario A. Recovery Time(in sec) Recovery Time (in sec) TRS FastestRS LaRS KB 32KB 64KB 128KB 256KB (a) Scenario A TRS FastestRS LaRS KB 32KB 64KB 128KB 256KB (b) Scenario B Figure 7. Recovery time under different block sizes (16KB, 32KB, 64KB, 128KB, and 256KB). It is also observed that the recovery time decreases with the increasing block size. This trend lies in the fact that the number of disk I/Os decreases along with a increasing size of block, thereby reducing the overall transmission latency which dominates the recovery performance. 1) Impact Parameter r on Reconstruction Performance: In this group of tests, we attempt to evaluate the recovery performance of LaRS after adding a new parity node to the storage cluster. For convenience, we add a new parity node to the storage cluster with Scenario A. In this group of tests, five different block sizes {16KB, 32KB, 64KB, 128KB, and 256KB} are adopted, and the weight of new node is set to be 16 and 3, respectively. Fig. 8 shows the recovery time performance for LaRS at different situations. It is seen that the recovery time under r=4 is smaller than that under r=3 because the new node helps to increase the bandwidth resource. Similarly, in the case of r=4, the recovery time under weight=16 is smaller than that under weight=3, the reason lies in that the newly added node with weight=6 has higher transmission bandwidth so that the rebuilding node can achieve higher surviving-block-reading throughput. Recovery Time(in sec) r=3 r=4 weight=3 r=4 weight= KB 32KB 64KB 128KB 256KB Figure 8. Comparisons of reconstruction time of LaRS under different redundancy parameters (r=3, r=4 with weight=3, and r=4 with weight=16). V. FURTHER DISCUSSION Although parameters k=6, and r=3 are deployed in the evaluation, the coding parameters can be adjusted according to a specific I/O scenario. WAS adopts a configuration of k=12, and r=4 in its early storage cluster[12], and our LaRS scheme still takes effect for the configuration of k=12, and r=4 In the case of single failure, one rebuilding node fetches surviving blocks from all surviving nodes according to the weight of each surviving node. Under multiple node failures, our LaRS still works. In particular, there exist f (for, 2 f r) failed nodes, one rebuilding node retrieves surviving blocks from k+r-f surviving nodes with the help of a determined bitmap BM (see Algorithm 1). Certainly, compared to the single-node-failure case, LaRS is expected to have a degraded performance under f (f 2) node failures since there are fewer surviving nodes. VI. CONCLUSION AND FUTURE WORK How to accomplish efficient recovery from node failures in heterogeneous erasure-coded storage systems is an important research topic. Aiming to speed up recovery performance, we present a Load-Aware recovery scheme (i.e., LaRS) to maximize the utilization of all surviving nodes by exploiting their heterogeneity. We evaluate LaRS and two alternative schemes 174

8 under a real-world heterogeneous erasure-coded storage cluster. The comparative experiments justify the effectiveness of our LaRS scheme in achieving efficient node recovery in the heterogeneous storage environment. Especially, the experimental results indicate that our LaRS scheme outperforms the other two schemes by a factor of up to 1.58 and 1.52 in a 9-node heterogeneous RS-coded storage cluster. We have considered the single-node-failure case for the recovery schemes in this paper. Analytically, our LaRS scheme still takes affect in the case of double or more failures, and we plan to evaluate the performance of LaRS for double- and more-node recovery in the future work. ACKNOWLEDGMENT This work is supported in part by the National High Technology Research Program of China under Grant No.213AA1323 and the National Basic Research Program of China under Grant No.211CB3233. This work is also the Fundamental Research Funds for the Central Universities, HUST, under No.214QN12. REFERENCES [1] D. Ford, F. Labelle, F. Popovici, M. Stokely, V. Truong, L. Barroso, C. Grimes, and S. Quinlan, Availability in globally distributed storage systems, in Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, 21. [2] B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci et al., Windows azure storage: a highly available cloud storage service with strong consistency, in Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 211, pp [3] S. Ghemawat, H. Gobioff, and S.-T. Leung, The google file system, in ACM SIGOPS Operating Systems Review, vol. 37, no. 5. ACM, 23, pp [4] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, Dynamo: amazon s highly available key-value store, in SOSP, vol. 7, 27, pp [5] R. L. Collins and J. S. Plank, Downloading replicated, widearea files-a framework and empirical evaluation, in Network Computing and Applications, 24.(NCA 24). Proceedings. Third IEEE International Symposium on. IEEE, 24, pp [6] R. Bhagwan, K. Tati, Y.-C. Cheng, S. Savage, and G. M. Voelker, Total recall: System support for automated availability management. in NSDI, vol. 4, 24, pp [7] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer et al., Oceanstore: An architecture for global-scale persistent storage, ACM Sigplan Notices, vol. 35, no. 11, pp , 2. [8] Z. Wang, A. G. Dimakis, and J. Bruck, Rebuilding for array codes in distributed storage systems, in GLOBECOM Workshops (GC Wkshps), 21 IEEE. IEEE, 21, pp [9] L. Xiang, Y. Xu, J. Lui, Q. Chang, Y. Pan, and R. Li, A hybrid approach to failed disk recovery using raid-6 codes: Algorithms and performance evaluation, ACM Transactions on Storage (TOS), vol. 7, no. 3, p. 11, 211. [1] J. Huang, X. Liang, X. Qin, Q. Cao, and C. Xie, Push: A pipeline reconstruction i/o for erasure-coded storage clusters, IEEE Transactions on Parallel and Distributed Systems, 214. [11] I. S. Reed and G. Solomon, Polynomial codes over certain finite fields, Journal of the Society for Industrial & Applied Mathematics, vol. 8, no. 2, pp. 3 34, 196. [12] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin, Erasure coding in windows azure storage, in Proceedings of the 212 Annual Conference on USENIX Annual Technical Conference (ATC 12). Boston MA, USA: USENIX, 212. [13] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, and S. Sankar, Row-diagonal parity for double disk failure correction, in Proceedings of the 3rd USENIX Conference on File and Storage Technologies, 24, pp [14] M. Blaum, J. Brady, J. Bruck, and J. Menon, Evenodd: An efficient scheme for tolerating double disk failures in raid architectures, Computers, IEEE Transactions on, vol. 44, no. 2, pp , [15] J. Huang, X. Qin, F. Zhang, W.-S. Ku, and C. Xie, Mfts: A multi-level fault-tolerant archiving storage with optimized maintenance bandwidth, IEEE Transactions on Dependable and Secure Computing, p. 1, 214. [16] C. Wu, X. He, G. Wu, S. Wan, X. Liu, Q. Cao, and C. Xie, Hdp code: A horizontal-diagonal parity code to optimize i/o load balancing in raid-6, in Dependable Systems & Networks (DSN), 211 IEEE/IFIP 41st International Conference on. IEEE, 211, pp [17] L. Tian, D. Feng, H. Jiang, K. Zhou, L. Zeng, J. Chen, Z. Wang, and Z. Song, Pro: A popularity-based multi-threaded reconstruction optimization for raid-structured storage systems. in FAST, vol. 7, 27, pp [18] Y. Zhu, P. P. Lee, L. Xiang, Y. Xu, and L. Gao, A cost-based heterogeneous recovery scheme for distributed storage systems with raid-6 codes, in Dependable Systems and Networks (DSN), nd Annual IEEE/IFIP International Conference on. IEEE, 212, pp [19] Y. Zhu, P. P. Lee, Y. Hu, L. Xiang, and Y. Xu, On the speedup of single-disk failure recovery in xor-coded storage systems: Theory and practice, in Mass Storage Systems and Technologies (MSST), 212 IEEE 28th Symposium on. IEEE, 212, pp [2] W. Almesberger et al., Linux network traffic controlłimplementation overview, [21] hdparm utility - get/set ata/sata drive parameters under linux, Open source code distribution: [22] J. S. Plank, S. Simmerman, and C. D. Schuman, Jerasure: A library in c/c++ facilitating erasure coding for storage applications-version 1.2, University of Tennessee, Tech. Rep. CS-8-627, vol. 23,

Short Code: An Efficient RAID-6 MDS Code for Optimizing Degraded Reads and Partial Stripe Writes

Short Code: An Efficient RAID-6 MDS Code for Optimizing Degraded Reads and Partial Stripe Writes : An Efficient RAID-6 MDS Code for Optimizing Degraded Reads and Partial Stripe Writes Yingxun Fu, Jiwu Shu, Xianghong Luo, Zhirong Shen, and Qingda Hu Abstract As reliability requirements are increasingly