LaRS: A Load-aware Recovery Scheme for Heterogeneous Erasure-Coded Storage Clusters

Size: px
Start display at page:

Download "LaRS: A Load-aware Recovery Scheme for Heterogeneous Erasure-Coded Storage Clusters"

Transcription

1 214 9th IEEE International Conference on Networking, Architecture, and Storage LaRS: A Load-aware Recovery Scheme for Heterogeneous Erasure-Coded Storage Clusters Haibing Luo, Jianzhong Huang, Qiang Cao and Changsheng Xie Wuhan National Lab. for Optoelectronics, Huazhong University of Science and Technology, Wuhan, China {luohaibinghust@gmail.com, Corresponding Author:hjzh@hust.edu.cn} Abstract To reduce the probability of data unavailability, it is extremely important to quickly recover failed data in a (k+r,k) erasure-coded storage cluster. In practice, storage nodes in a large-scale storage system have various network bandwidths and I/O capabilities, therefore, the heterogeneity of storage systems increases along with the growing scale. Both traditional recovery scheme and Fastest recovery scheme simply retrieve k surviving blocks from k surviving nodes, thereby resulting in low recovery performance in a heterogeneous storage cluster. In this paper, we propose a Load-aware Recovery Scheme (LaRS) for heterogeneous RS-coded storage clusters. LaRS not only takes into account both the heterogeneity and load of nodes, but also enables all surviving nodes to service reconstruction reads. The amount of surviving blocks retrieved by a surviving node depends on its load weight which is determined by both network bandwidth and I/O capacity. More blocks are fetched from faster nodes, and vice versa. The three recovery schemes are implemented on a 9-node heterogeneous RS-coded storage cluster, where a set of comparative experiments are conducted. The experimental results show that our LaRS scheme outperforms the other two schemes by a factor of up to Keywords-Heterogeneous storage; erasure codes; recovery scheme; I. INTRODUCTION Storage clusters provide a scalable platform for storing massive data over multiple storage nodes, for example, they have been deployed in cloud storage (e.g., GFS II [1] and Microsoft Azure Storage (WAS) [2]). Since node failures are common, data redundancy mechanisms (e.g., replication [3][4] [5] or erasure coding [6][7][2]) are used to ensure data availability. Because of its higher fault tolerance with the same storage overhead, (k+r,k) Reed Solomon (RS) codes are used by realworld storage clusters (e.g. GFS II and WAS). To maintain the overall system reliability, it is important to rapidly recover failed node, i.e., to reconstruct the lost data in failed node by retrieving required blocks from surviving nodes. Over the last decade, the scale of storage systems grows rapidly due to the data explosion. For example, the amount of the Microsoft s servers in 213 has reached one million according to Microsoft CEO Steve Ballmer, and one of the data centers built by Google has more than 45 thousands servers. Thus, it is very difficult to maintain the homogeneity of servers in such a large scale. Especially, these servers may come from different vendors, and the storage nodes from the same vendor may have different capabilities after system upgrades. Heterogeneity of storage nodes has reverse effect on node recovery within the traditional recovery solution (TRS), where only k surviving nodes are involved in the node recovery. Although some recovery solutions are proposed for erasure-coded storage systems [8][9][1], they usually assume that all disks or nodes are homogeneous. Actually, both the transmission bandwidth and I/O capability of nodes are usually different. The study of efficient recovery from node failures in a heterogeneous RS-coded storage system is in its infancy. We propose a Load-aware Reconstruction Scheme - LaRS - for heterogeneous erasure-coded storage systems. The key design idea of LaRS is to reduce the reconstruction time based on the load of all surviving nodes when one or more failures occur in a heterogeneous distributed storage system. Specially, fewer (or more) blocks are retrieved from the slower (or faster) surviving storage nodes, thus minimizing the negative impact of slow storage nodes on the total recovery time. Our main contribution is twofold. First, we propose an efficient recovery strategy that retrieves surviving blocks from all surviving nodes and the load of a surviving node determines the amount of surviving blocks retrieved from the surviving node. Second, we demonstrate the LaRS scheme in a quantitative manner and validate its precedence by prototyping it in a heterogeneous erasure-coded storage cluster. The remainder of this paper is organized as follows. In Section II, we present the related work. Section III details the concept and design of LaRS. We quantitatively analyze LaRS in Section IV. Finally, we conclude our work in Section VI. II. RELATED WORK A. Erasure-Coded Storage Clusters Besides replication, erasure codes is another redundancy mechanism for storage systems. In a (k+r,k) RS-coded storage cluster (see Fig. 1), where both k original data blocks and r parity blocks generated via the erasure codes are exclusively stored in the storage nodes. Any k blocks in a stripe can recover r original blocks. As a result, data stored on the RS-coded storage cluster are protected against simultaneous failures of up to r nodes. RS codes are Maximum Distance Separable (MDS), which means that if any r of k+r nodes fails, the /14 $ IEEE DOI 1.119/NAS

2 failed data can be recomputed from k surviving disks [11]. We investigate our revovery scheme for clustered storage based on RS codes, which are also used in many in-production storage systems, including GFS II [1] and Microsoft Azure storage [12]. Data Nodes Data Blocks DN 1 DN 2 DN k PN 1 PN r D 1,1 D 1,2 D 1,k P 1,1 D 2,1 D 2,2 D 2,k Client Switch P 1,r P 1,1 P 1,r Manager Parity Nodes Parity Blocks Figure 1. Architecture and data layout of (k+r, k) RS-coded storage clusters. Commonly, (k+r, k) RS coding represents that there are k+r encoded blocks in a stripe, including k information blocks (namely data blocks) and r redundancy blocks (namely parity blocks). RS codes use Generator Matrix and Galois Field arithmetic in both encoding and decoding. Generator Matrix consists of a k k Identity Matrix and a k r Redundancy Matrix. A potentially limitless sequence of encoded blocks can be generated from data blocks. Due to Galois Field multiplications, RS codes are much more expensive in computation than parity array codes (e.g., RDP [13], EVENODD [14], etc.), which only involve XOR calculations. As mentioned in [15], it is not CPU computation but network latency and storage bandwidth play a dominated role in access performance for networked RS-coded storage systems. B. Recovery Schemes Most of existing recovery schemes are designed for RAID systems. There are mainly two categories: (1) Based on the assumption that the disks are homogeneous. For example, Xiang et al. [9] and Wang et al. [8] show the optimal singlenode failure recovery solutions for RDP and EVENODD, and both schemes utilize the redundancy property of XOR-based erasure codes to optimize the recovery, without considering the heterogeneity and load of disks. Although Wu et al. [16] enable a new RAID-6 code to evenly distribute read/write requests among surviving disks, the key premise is that all surviving disks have the same I/O capabilities. (2) Taking the load of disks into account. For instance, Tian et al. [17] design a popularity-based reconstruction optimization scheme (PRO) for RAID-structured storage systems. The PRO scheme reorders recovery I/Os by firstly rebuilding frequently read data, thus minimizing the amount of read requests incurred by rebuilding process. Zhu et al. propose a cost-based heterogeneous recovery (CHR) [18] and a replace recovery algorithm [19] for distributed storage systems based on RAID-6 codes. Both recovery solutions are mainly aimed at minimizing the total recovery cost for a single-node failure by taking advantage of the property that there are multiple recovery chains for a single-node recovery in double-fault-tolerant RAID-6 codes. Different from the CHR scheme that addresses node heterogeneity by periodically choosing a suitable recovery path, our LaRS scheme maximizes the utilization of all surviving nodes by retrieving more surviving nodes from faster surviving nodes. In this work, we present a load-aware recovery scheme that schedule the recovery read requests according to the surviving node s weight, which is varied along with both the heterogeneity and load of each surviving storage node. To our knowledge, this is the first work that explicitly seeks to optimize failure recovery for heterogeneous distributed storage system with Reed Solomon codes. III. DESIGN OF LARS In this section, we present the main idea and design of LaRS, and discuss why it is suitable for heterogeneous erasure-codesbased storage clusters. To simplify our discussion, we focus on the single-node failure recovery problem, but it is apparent that LaRS is also suitable for multi-node failure recovery problem. We consider a (n=k+r, k) RS-coded storage cluster. As shown in Fig. 1, the storage system has n storage nodes and a manager which are connected with local area network (LAN). Encoded data is striped across the storage nodes to enable high I/O throughput. In a stripe, k data blocks and r parity blocks are exclusively stored on k data nodes (DNs) and r parity nodes (PNs). Now let data node DN i be the failed node, where (1 i k). Our goal is to recover all lost data of DN i. According to RS codes, if we want to recover a data block on a storage node, we should read k blocks that are in the same stripe with the lost data block from any k surviving nodes. However, some problems may appear under the heterogeneous environment: First, some surviving nodes which are of low transmission bandwidths and I/O capacity may delay data transmission, thus enlarging the total reconstruction time; Second, it leads to a waste of resources since some surviving nodes are not involved in the reconstruction process. The basic idea of LaRS is to retrieve required surviving blocks from all surviving nodes, and to retrieve fewer (or more) blocks from the slower (or faster) surviving nodes to reconstruct the failed data. Unlike the traditional reconstruction scheme, LaRS will retrieve surviving blocks from multiple stripes and recover multiple failed blocks on the failed node. When a storage node fails, the manager first obtains the 169

3 I/O response time of each surviving node which are used to calculate the weight of each surviving node, then it determine the number and location of surviving blocks provided by a surviving node according to its weight. The rebuilding node retrieves the associate surviving blocks based on the result given by the manager and then reconstructs the lost data. The above procedure is repeated until all data in the failed node is fully recovered. As above-analyzed, the LaRS scheme has two advantages:(1) maximizing the utilization of all surviving nodes; (2) make the transmission time of each node more balanced so as to reduce unnecessary waiting time. Although the retrieved surviving blocks in the disk of a surviving node may be not contiguous since they belong to different stripes, the disk access overhead does not become the performance bottleneck of node reconstruction. The reason is that the surviving-blockreceiving step occurred in the rebuilding node dominates the total reconstruction time. Therefore, LaRS design should address the following two aspects: (1) an effective algorithm that determines bitmap BM[N d ][N] according to the weight set W [1N] ; (2) a method to update the weight of each surviving node. Table I lists symbols that will be used later. Symbol Table I. Symbol and Annotation. Annotation k, r Number of data blocks and parity blocks in a stripe N Number of surviving nodes in the storage cluster SN i The i th surviving storage node, 1 i N W i The weight of node SN i W [1N] The set of weight W i; W [1N] ={W 1,W 2,, W N } T A i Average response time of SN i within a sliding window L i The load status of SN i L Upper The upper bound of L i for updating weight W i of SN i L Lower The lower bound of L i for updating weight W i of SN i N blk,i # of blocks fetched from SN i to the rebuilding node T R The total recovery time T U The weight updating interval T r,i Time spent to retrieve a block from SN i W [1k] The set of the biggest k values in weight set W [1N] BM[N d ][N] Bitmap of surviving blocks used to reconstruct failed blocks A. Effectiveness Explanation The total recovery time T R is of much importance for a storage system. If we can reduce T R it means that we can improve the reliability and availability of the storage system. So our goal is to reduce T R. Recovery process including three steps: retrieving surviving blocks from surviving nodes, calculating the failed data blocks, and writing the recovered blocks. Since the three steps can be carried out in a pipelining manner, the total recovery time T R is restricted by the slowest step. In a heterogeneous cluster, storage nodes have various I/O capacities and transmission bandwidths, some of them may have very low I/O capacity and transmission bandwidth. Comparatively, computing is not the performance bottleneck commonly. Given all of this, usually, the step of retrieving surviving blocs is the performance bottleneck of the recovery process, thus we can get Eq. (1) as below: T R = AD Max N i=1{n A blk,i T r,i}; (1) D A D is the total amount of data that should be recovered in the failed node. A D represents the amount of data that the rebuilding node recovers in a recovery process according to bitmap BM[N d ][N]. T r,i is the time spent to retrieve a block from SN i. Max N i=1 {N blk,i T r,i } is the time spent to retrieve all the required blocks used to recover the A D data in a recovery process. Here, we focus on reducing the value of Max N i=1 {N blk,i T r,i } so as to speed up the recovery process. Two steps are involved in recovering the A D data in a recovery process. (1) to retrieve the needed data; and (2) to calculate the failed blocks. The former setp needs ro retrieve required blocks from all the surviving nodes. The time spent to download blocks from a surviving node includes two parts: T latency and T transfer. The surviving node spends time T latency to read local blocks, which are then transferred to the rebuilding node within time T transfer. The transmission bandwidth of surviving nodes may fluctuate widely in a heterogeneous cluster, so time T transfer dominates the access latency for slow nodes.meanwhile, there may be a number of surviving nodes that have low-speed I/O systems or outdated disks,and the I/O latency T latency and of these nodes may be several times of that of others. Furthermore, the transfer time T transfer changes quickly along with load fluctuations, and the value T transfer of a surviving node may be an order of magnitude larger than that of another surviving node. Time T r,i is affected by two factors: network transfer and disk access. If the available network bandwidth of the i th surviving node is low, the rebuilding node will slowly receive the surviving blocks. If the disk in the surviving node suffers a heavy I/O request (i.e. overload I/Os), then the disk access time will increase. It is apparent that the total time of a recovery process is determined by the slowest storage node. Therefore, if we can reduce time T r of some slower nodes, we can reduce time T R. By virtue of LaRS, we retrieve fewer (or more) blocks from the surviving storage nodes which are slower (or faster). Therefor,all surviving nodes (slow or fast) spend similar time to transmit a specific number of local blocks so as to reduce the unnecessary waiting time of the rebuilding node.there by, time T R can be minimized. B. LaRS The manager periodically monitors the condition of the storage cluster. When a storage node fails, the manager records the 17

4 completion time of retrieving blocks from each surviving node, and calculates the weight of all surviving nodes according to the recorded completion times. The weight W i of the i th surviving node can be calculated according to Eq. (2). W i = 1/(T r,i 1) (2) T r,i is the average completion time of downloading a block from SN i. Weight represents the ability to transmit data of SN i, and the larger W i indicates that node SN i has higher transmission performance. To conventiently computing the value of weight W i, we multiply the time T r,i by a constant value (e.g., 1). With the weight set W [1N] we can use the algorithm to calculate the result including the number and location of the data blocks that the rebuilding node should read from the surviving nodes. We let BM be a N d N two-dimensional matrix that identifies the number and location of the blocks retrieved from surviving nodes, such that the element BM[i][j] (for i N d -1, j N-1) is initialized to, and will be set to 1 if the block with BM[i][j] is going to be retrieved. Then we calculate the result according to W [1N] as follows: We first check the number of the elements greater than in set W [1k] ; If the number is greater than or equal to k, we will select the biggest k elements from W [1k], and make each of them reduced by 1, and set the corresponding elements in array BM[N d ][N] to be 1 ; Otherwise, the iteration will be terminated. N d = Σ N i=w i/k (3) of array BM denotes a storage node and a 1 in the column represents that a block in the corresponding node should be retrieved to the rebuilding node. A row of BM represents a stripe. All the blocks tagged with 1 in a row can be used to reconstruct a failed block. In this way, the rebuilding node can retrieve surviving blocks according to the array BM[N d ][N] to recover the lost data. C. Weight Updating Through the analysis of the previous section we know the weights of storage nodes are very important for recovery scheme. Considering that the load status of the storage nodes is not constant during the recovery, the weight of each storage node should be dynamic. For instance, if user s I/O requests for a storage node are frequent for a period of time, it may cause the increasing of load and then leads to a reduction of the available network bandwidth of this node. With regard to this, it is desirable to propose an algorithm to update the weights of storage nodes during the recovery. T A i is the average response time of SN i within sliding window and thus it has a positive correlation with load status, while it is not enough to judge whether the weight W i should be updated or not. So it is required to adopt a parameter to indicate the load status of a surviving node. T A a is the average response time of all the storage nodes within sliding window. We can use the ratio L i between T A i and T A a to reflect the load status of i th surviving node SN i. L i = T A i/t A a (4) Algorithm 1: Determining the Bitmap of Surviving Blocks // W[1N] denotes the number of elements greater than in the weight set W [1N] Input: k: number of data nodes in the storage cluster N: number of surviving nodes in the storage cluster W [1N] : The weight set of all storage nodes Output: BM[N d ][N]: Bitmap of surviving blocks used for recovery BM[N d ][N]={}, j= // Initializing while W [1N] k do Select the biggest k elements from W [1N] to form W [1k] ; foreach W i W [1k] do W i -- ; BM[j][i]=1; end j++; end return BM[N d ][N] Algorithm 1 gives the pseudo-code to generate the array BM[N d ][N] according to the weight set W [1N]. A column We set two parameters: L upper and L lower. L upper is the upper bound of the load status and L lower is the lower bound of the load status. Once the load status L i of SN i exceeds the L upper, it means that this storage node is overloaded than other nodes and we should reduce the W i. However, if L i is less than L lower, it indicates this storage node has a lower workload and W i should be increased; Otherwise W i remains unchanged. In order to keep the recovery process steady W i should be increased or decreased 1 at a time.since the I/O loads change quickly in LAN-based storage clusters, we set the updating interval to be 1 second. Algorithm 2: Updating weights of storage nodes for SN i {SN i } do if L i > L upper then W i --; else if L i < L lower then W i ++; end end return W [1N] 171

5 Algorithm 2 describes the procedure of updating weights of storage nodes. This procedure is triggered periodically to adjust the weight W i, thus the LaRS scheme can maximize the utilization of the surviving nodes. Both parameters L upper and L lower usually are associated with the configuration of a specific storage cluster and should be determined through concrete experiments. SN 1 SN 2 SN 3 SN 4 SN 5 SN 6 SN 7 SN 8 Figure 3. The data distribution graph of LaRS, where the number of blocks retrieved from i th (i {1, 2,, 8}) surviving node is {6, 6, 6, 6, 5, 3, 3, 1}, respectively. D. Example We demonstrate LaRS using a concrete example, and illustrate how it improves the recovery performance over the traditional and Fastest recovery schemes. Fig. 2 shows a storage system in which 9 storage nodes are connected via a switch. Each node is characterized by a average response time. The storage system uses RS codes and has 6 data nodes and 3 parity nodes. Each storage node stores N b data blocks of size S b. We assume a data node fails. We calculate W [1N] according to Eq. (2). The weight of each surviving node is shown in table II. The total recovery time can be calculated using Eq. (1). Failed Node x Rebuilding node ms 47ms SN 8 SN 1 1ms 33ms SN 7 Switch 1ms SN 2 12ms 33ms 16ms SN 6 2ms SN 3 SN 4 SN 5 SN 8 = 8 th Surviving Node 2) Traditional Recovery Scheme (TRS): In TRS, surviving blocks are read from k random surviving nodes. Assume that TRS accomplishes the reconstruction using the surviving blocks in nodes {SN 1, SN 2, SN 3, SN 4, SN 5, and SN 8 }. Similarly, with Eq. (1), the total recovery time T R,TRS is equal to N b S b 6 47/(6 S b )=47 N b. SN 1 SN 2 SN 3 SN 4 SN 5 SN 6 SN 7 SN 8 Figure 4. The data distribution graph of TRS, where the number of blocks retrieved from i th (i {1, 2,, 8}) surviving node is {6, 6, 6, 6, 6,,, 6}, respectively. 3) Fastest Recovery Scheme (FastestRS): In FastestRS, the fastest k surviving nodes are selected to provide surviving blocks. In this case, the fastest 6 nodes are nodes {SN 1, SN 2,.., SN 6 }. So the total recovery time T R,FastestRS is N b S b 6 33/(6 S b )=33 N b. Figure 2. A 9-node heterogeneous storage cluster. Table II. The Storage Nodes Weights. Node SN 1 SN 2 SN 3 SN 4 SN 5 SN 6 SN 7 SN 8 T r,i (ms) Weight SN 1 SN 2 SN 3 SN 4 SN 5 SN 6 SN 7 SN 8 Figure 5. The data distribution graph of FastestRS, where the number of blocks retrieved from i th (i {1, 2,, 8}) surviving node is {6, 6, 6, 6, 6, 6,, }, respectively. As analyzed above, we observe that the recovery time of LARS is lowest, and it is 66% and 5% lower than that of TRS and FastestRS, respectively. 1) LaRS scheme: LaRS firstly calculates the array BM, which indicates the amount and location of surviving blocks. In this example, we can demonstrate that after the rebuilding node reads 6 blocks from nodes {SN 1, SN 2, SN 3, and SN 4 }, 5 blocks from node SN 5, 2 blocks from nodes SN 7 and SN 8, and 3 blocks from node SN 6, it can reconstruct 6 failed blocks. This process continues until all blocks in the failed node are recovered. According to Eq. (1), the total recovery time T R,lars is equal to N b S b 3 33/(6 S b )=15.5 N b. IV. PERFORMANCE EVALUATION In this section, we comparatively experiment different recovery schemes within a heterogeneous networked storage cluster. We evaluate three recovery schemes in the case of single node failure:(1) the TRS scheme, which fetches data blocks from the random k surviving storage nodes; (2)the FastestRS scheme, which fetches data blocks from the fastest k surviving storage nodes; and (3) our LaRS scheme, which fetches data blocks according to the performance of each surviving storage node. 172

6 A. Experimental Setup Our experiments are conducted upon on a commodity-based erasure-codes cluster that consists of 9 storage nodes. All the nodes are connected through a Cisco switch. Each storage node contains an Intel(R) 3.2GHz CPU, 4GB DDR3 main memory, and Intel G41 Chipset Mainboard with 1Gbps Ethernet NIC. All disks in storage nodes are West Digital Release of Enterprise WD12FBYS SATA2. The operating systems of the storage nodes are Ubuntu 1.4 X86 64(Kernel ). The testbed is configured to have a heterogeneous setting, as we use tool Traffic Control (TC) [2] to configure Ethernet NIC and tool hdparm [21] to configure disks on different storage nodes with various parameters. B. Methodology Evidence shows that a configuration of r=3 achieves a sufficiently large MTTDL for archival storage systems. Therefore, we reserves three storage nodes as parity ones in our tests. In particular, coding parameters k=6 and r=3 are adopted. To examine the impact of block unit size on the recovery performance, we deploy different different block sizes {16KB, 32KB, 64KB, 128KB, and 256KB} for all the three recovery schemes. We use tools TC and hdparm to adjust storage nodes using various parameters, thus making the storage nodes have different response times. As a result, two heterogeneous scenarios which simulate real-world application environments are reached: (1) Scenario A simulates the environment where half of the surviving nodes in the cluster are very fast while the remaining ones are slow; (2) scenario B imitates the environment where there are different types of hardware. Fig. 6 shows the average completion time of downloading a block of the 8 surviving nodes. Table III. The Storage Nodes Weights in the Scenarios. Node DN 1 DN 2 DN 3 DN 4 DN 5 PN 1 PN 2 PN 3 Scenario A Scenario B In experiments, we assume that the volume of each node is 1GBytes. We let the reconstruction time to be the performance metric of the reconstruction schemes, and present the average results of reconstruction time over the five runs for each test. We implement the three recovery schemes (i.e., TRS, FastestRS, the LaRS) in an application-level recovery program. The recovery program is running on the rebuilding node. The decoding operation employs Jerasure library [22] which is the fastest RS code supportting bit-matrices. C. Experimental Results and Analysis After intensive experiments, we find that the LaRS scheme performs well when parameters L upper and L lower are in Average Completion Time(ms) Average Completion Time(ms) DN1 DN2 DN3 DN4 DN5 PN1 PN2 PN (a) Scenario A DN1 DN2 DN3 DN4 DN5 PN1 PN2 PN3 (b) Scenario B Figure 6. The characteristics of two application scenarios. range of [1.5, 1.7] and [.4,.6], respectively. In addition, the updating interval of 1 second is accurate to reflect the variance of I/O loads in the LAN-based storage cluster. Therefore, we respectively set the parameters L upper, L lower and T U to 1.6,.5, and 1 in the latter tests. We first evaluate the recovery performance of different recovery strategies in both scenarios. As shown in Fig. 7, both FastestRS and LaRS outperform TRS, because FastestRS and LaRS schemes take into account the heterogeneity of the storage cluster to improve the recovery performance while TRS read surviving blocks from k randomly-chosen surviving nodes. Furthermore, LaRS outperforms FastestRS in both scenarios under different block sizes. The reason lies in the fact that the FastestRS scheme simply reads data from the fastest k nodes, and LaRS further improves the read performance by exploiting the heterogeneity among the storage nodes. From Fig. 7(a), we know LaRS achieves the highest reconstruction performance. Specially, LaRS speeds up the recovery time of TRS and FastestRS by a factor of up to 1.58 and 1.52 under scenario A when the block size is 256KB, respectively, because LaRS makes good use of all surviving nodes and assignment task according to the capacity of each node. On the other hand, the reconstruction performance of TRS is slightly better than that of FastestRS, this is because half of the surviving nodes are very slow in scenario A so that both TRS and FastestRS choose some slow surviving nodes during reconstruction. Under scenario B, the cluster contains different types of 173

7 surviving nodes. That is, all storage nodes have different transmission speeds and I/O capacity. As shown in Fig. 7(b), compared to the TRS, the total recovery times for FastestRS and LaRS decrease by up to 14% and 42%, respectively. And there exists a significant performance distinction between TRS and FastestRS, because each storage node has different performance and FastestRS can choose the fastest k nodes while TRS may choose some slow nodes. Meanwhile LaRS gets a shorter recovery time than the results of scenario A because LaRS can make the workload on each surviving node more balanced under scenario B. From Figs. 7(a) and 7(b), we observe LaRS always has the best reconstruction performance under both scenarios (i.e., scenario A and scenario B). FastestRS performs better than TRS under scenario B, while both FastestRS and TRS have close performance under scenario A. Recovery Time(in sec) Recovery Time (in sec) TRS FastestRS LaRS KB 32KB 64KB 128KB 256KB (a) Scenario A TRS FastestRS LaRS KB 32KB 64KB 128KB 256KB (b) Scenario B Figure 7. Recovery time under different block sizes (16KB, 32KB, 64KB, 128KB, and 256KB). It is also observed that the recovery time decreases with the increasing block size. This trend lies in the fact that the number of disk I/Os decreases along with a increasing size of block, thereby reducing the overall transmission latency which dominates the recovery performance. 1) Impact Parameter r on Reconstruction Performance: In this group of tests, we attempt to evaluate the recovery performance of LaRS after adding a new parity node to the storage cluster. For convenience, we add a new parity node to the storage cluster with Scenario A. In this group of tests, five different block sizes {16KB, 32KB, 64KB, 128KB, and 256KB} are adopted, and the weight of new node is set to be 16 and 3, respectively. Fig. 8 shows the recovery time performance for LaRS at different situations. It is seen that the recovery time under r=4 is smaller than that under r=3 because the new node helps to increase the bandwidth resource. Similarly, in the case of r=4, the recovery time under weight=16 is smaller than that under weight=3, the reason lies in that the newly added node with weight=6 has higher transmission bandwidth so that the rebuilding node can achieve higher surviving-block-reading throughput. Recovery Time(in sec) r=3 r=4 weight=3 r=4 weight= KB 32KB 64KB 128KB 256KB Figure 8. Comparisons of reconstruction time of LaRS under different redundancy parameters (r=3, r=4 with weight=3, and r=4 with weight=16). V. FURTHER DISCUSSION Although parameters k=6, and r=3 are deployed in the evaluation, the coding parameters can be adjusted according to a specific I/O scenario. WAS adopts a configuration of k=12, and r=4 in its early storage cluster[12], and our LaRS scheme still takes effect for the configuration of k=12, and r=4 In the case of single failure, one rebuilding node fetches surviving blocks from all surviving nodes according to the weight of each surviving node. Under multiple node failures, our LaRS still works. In particular, there exist f (for, 2 f r) failed nodes, one rebuilding node retrieves surviving blocks from k+r-f surviving nodes with the help of a determined bitmap BM (see Algorithm 1). Certainly, compared to the single-node-failure case, LaRS is expected to have a degraded performance under f (f 2) node failures since there are fewer surviving nodes. VI. CONCLUSION AND FUTURE WORK How to accomplish efficient recovery from node failures in heterogeneous erasure-coded storage systems is an important research topic. Aiming to speed up recovery performance, we present a Load-Aware recovery scheme (i.e., LaRS) to maximize the utilization of all surviving nodes by exploiting their heterogeneity. We evaluate LaRS and two alternative schemes 174

8 under a real-world heterogeneous erasure-coded storage cluster. The comparative experiments justify the effectiveness of our LaRS scheme in achieving efficient node recovery in the heterogeneous storage environment. Especially, the experimental results indicate that our LaRS scheme outperforms the other two schemes by a factor of up to 1.58 and 1.52 in a 9-node heterogeneous RS-coded storage cluster. We have considered the single-node-failure case for the recovery schemes in this paper. Analytically, our LaRS scheme still takes affect in the case of double or more failures, and we plan to evaluate the performance of LaRS for double- and more-node recovery in the future work. ACKNOWLEDGMENT This work is supported in part by the National High Technology Research Program of China under Grant No.213AA1323 and the National Basic Research Program of China under Grant No.211CB3233. This work is also the Fundamental Research Funds for the Central Universities, HUST, under No.214QN12. REFERENCES [1] D. Ford, F. Labelle, F. Popovici, M. Stokely, V. Truong, L. Barroso, C. Grimes, and S. Quinlan, Availability in globally distributed storage systems, in Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, 21. [2] B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci et al., Windows azure storage: a highly available cloud storage service with strong consistency, in Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 211, pp [3] S. Ghemawat, H. Gobioff, and S.-T. Leung, The google file system, in ACM SIGOPS Operating Systems Review, vol. 37, no. 5. ACM, 23, pp [4] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, Dynamo: amazon s highly available key-value store, in SOSP, vol. 7, 27, pp [5] R. L. Collins and J. S. Plank, Downloading replicated, widearea files-a framework and empirical evaluation, in Network Computing and Applications, 24.(NCA 24). Proceedings. Third IEEE International Symposium on. IEEE, 24, pp [6] R. Bhagwan, K. Tati, Y.-C. Cheng, S. Savage, and G. M. Voelker, Total recall: System support for automated availability management. in NSDI, vol. 4, 24, pp [7] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer et al., Oceanstore: An architecture for global-scale persistent storage, ACM Sigplan Notices, vol. 35, no. 11, pp , 2. [8] Z. Wang, A. G. Dimakis, and J. Bruck, Rebuilding for array codes in distributed storage systems, in GLOBECOM Workshops (GC Wkshps), 21 IEEE. IEEE, 21, pp [9] L. Xiang, Y. Xu, J. Lui, Q. Chang, Y. Pan, and R. Li, A hybrid approach to failed disk recovery using raid-6 codes: Algorithms and performance evaluation, ACM Transactions on Storage (TOS), vol. 7, no. 3, p. 11, 211. [1] J. Huang, X. Liang, X. Qin, Q. Cao, and C. Xie, Push: A pipeline reconstruction i/o for erasure-coded storage clusters, IEEE Transactions on Parallel and Distributed Systems, 214. [11] I. S. Reed and G. Solomon, Polynomial codes over certain finite fields, Journal of the Society for Industrial & Applied Mathematics, vol. 8, no. 2, pp. 3 34, 196. [12] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin, Erasure coding in windows azure storage, in Proceedings of the 212 Annual Conference on USENIX Annual Technical Conference (ATC 12). Boston MA, USA: USENIX, 212. [13] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, and S. Sankar, Row-diagonal parity for double disk failure correction, in Proceedings of the 3rd USENIX Conference on File and Storage Technologies, 24, pp [14] M. Blaum, J. Brady, J. Bruck, and J. Menon, Evenodd: An efficient scheme for tolerating double disk failures in raid architectures, Computers, IEEE Transactions on, vol. 44, no. 2, pp , [15] J. Huang, X. Qin, F. Zhang, W.-S. Ku, and C. Xie, Mfts: A multi-level fault-tolerant archiving storage with optimized maintenance bandwidth, IEEE Transactions on Dependable and Secure Computing, p. 1, 214. [16] C. Wu, X. He, G. Wu, S. Wan, X. Liu, Q. Cao, and C. Xie, Hdp code: A horizontal-diagonal parity code to optimize i/o load balancing in raid-6, in Dependable Systems & Networks (DSN), 211 IEEE/IFIP 41st International Conference on. IEEE, 211, pp [17] L. Tian, D. Feng, H. Jiang, K. Zhou, L. Zeng, J. Chen, Z. Wang, and Z. Song, Pro: A popularity-based multi-threaded reconstruction optimization for raid-structured storage systems. in FAST, vol. 7, 27, pp [18] Y. Zhu, P. P. Lee, L. Xiang, Y. Xu, and L. Gao, A cost-based heterogeneous recovery scheme for distributed storage systems with raid-6 codes, in Dependable Systems and Networks (DSN), nd Annual IEEE/IFIP International Conference on. IEEE, 212, pp [19] Y. Zhu, P. P. Lee, Y. Hu, L. Xiang, and Y. Xu, On the speedup of single-disk failure recovery in xor-coded storage systems: Theory and practice, in Mass Storage Systems and Technologies (MSST), 212 IEEE 28th Symposium on. IEEE, 212, pp [2] W. Almesberger et al., Linux network traffic controlłimplementation overview, [21] hdparm utility - get/set ata/sata drive parameters under linux, Open source code distribution: [22] J. S. Plank, S. Simmerman, and C. D. Schuman, Jerasure: A library in c/c++ facilitating erasure coding for storage applications-version 1.2, University of Tennessee, Tech. Rep. CS-8-627, vol. 23,

Short Code: An Efficient RAID-6 MDS Code for Optimizing Degraded Reads and Partial Stripe Writes

Short Code: An Efficient RAID-6 MDS Code for Optimizing Degraded Reads and Partial Stripe Writes : An Efficient RAID-6 MDS Code for Optimizing Degraded Reads and Partial Stripe Writes Yingxun Fu, Jiwu Shu, Xianghong Luo, Zhirong Shen, and Qingda Hu Abstract As reliability requirements are increasingly

More information

Encoding-Aware Data Placement for Efficient Degraded Reads in XOR-Coded Storage Systems

Encoding-Aware Data Placement for Efficient Degraded Reads in XOR-Coded Storage Systems Encoding-Aware Data Placement for Efficient Degraded Reads in XOR-Coded Storage Systems Zhirong Shen, Patrick P. C. Lee, Jiwu Shu, Wenzhong Guo College of Mathematics and Computer Science, Fuzhou University

More information

On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems (Supplementary File)

On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems (Supplementary File) 1 On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems (Supplementary File) Yunfeng Zhu, Patrick P. C. Lee, Yinlong Xu, Yuchong Hu, and Liping Xiang 1 ADDITIONAL RELATED WORK Our work

More information

Performance Models of Access Latency in Cloud Storage Systems

Performance Models of Access Latency in Cloud Storage Systems Performance Models of Access Latency in Cloud Storage Systems Qiqi Shuai Email: qqshuai@eee.hku.hk Victor O.K. Li, Fellow, IEEE Email: vli@eee.hku.hk Yixuan Zhu Email: yxzhu@eee.hku.hk Abstract Access

More information

Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques. Tianli Zhou & Chao Tian Texas A&M University

Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques. Tianli Zhou & Chao Tian Texas A&M University Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques Tianli Zhou & Chao Tian Texas A&M University 2 Contents Motivation Background and Review Evaluating Individual

More information

V 2 -Code: A New Non-MDS Array Code with Optimal Reconstruction Performance for RAID-6

V 2 -Code: A New Non-MDS Array Code with Optimal Reconstruction Performance for RAID-6 V -Code: A New Non-MDS Array Code with Optimal Reconstruction Performance for RAID-6 Ping Xie 1, Jianzhong Huang 1, Qiang Cao 1, Xiao Qin, Changsheng Xie 1 1 School of Computer Science & Technology, Wuhan

More information

CORE: Augmenting Regenerating-Coding-Based Recovery for Single and Concurrent Failures in Distributed Storage Systems

CORE: Augmenting Regenerating-Coding-Based Recovery for Single and Concurrent Failures in Distributed Storage Systems CORE: Augmenting Regenerating-Coding-Based Recovery for Single and Concurrent Failures in Distributed Storage Systems Runhui Li, Jian Lin, Patrick P. C. Lee Department of Computer Science and Engineering,

More information

Seek-Efficient I/O Optimization in Single Failure Recovery for XOR-Coded Storage Systems

Seek-Efficient I/O Optimization in Single Failure Recovery for XOR-Coded Storage Systems IEEE th Symposium on Reliable Distributed Systems Seek-Efficient I/O Optimization in Single Failure Recovery for XOR-Coded Storage Systems Zhirong Shen, Jiwu Shu, Yingxun Fu Department of Computer Science

More information

RAID6L: A Log-Assisted RAID6 Storage Architecture with Improved Write Performance

RAID6L: A Log-Assisted RAID6 Storage Architecture with Improved Write Performance RAID6L: A Log-Assisted RAID6 Storage Architecture with Improved Write Performance Chao Jin, Dan Feng, Hong Jiang, Lei Tian School of Computer, Huazhong University of Science and Technology Wuhan National

More information

Repair Pipelining for Erasure-Coded Storage

Repair Pipelining for Erasure-Coded Storage Repair Pipelining for Erasure-Coded Storage Runhui Li, Xiaolu Li, Patrick P. C. Lee, Qun Huang The Chinese University of Hong Kong USENIX ATC 2017 1 Introduction Fault tolerance for distributed storage

More information

A Comprehensive Study on RAID-6 Codes: Horizontal vs. Vertical

A Comprehensive Study on RAID-6 Codes: Horizontal vs. Vertical 2011 Sixth IEEE International Conference on Networking, Architecture, and Storage A Comprehensive Study on RAID-6 Codes: Horizontal vs. Vertical Chao Jin, Dan Feng, Hong Jiang, Lei Tian School of Computer

More information

On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice

On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice Yunfeng Zhu, Patrick P. C. Lee, Yuchong Hu, Liping Xiang, and Yinlong Xu University of Science and Technology

More information

On Data Parallelism of Erasure Coding in Distributed Storage Systems

On Data Parallelism of Erasure Coding in Distributed Storage Systems On Data Parallelism of Erasure Coding in Distributed Storage Systems Jun Li, Baochun Li Department of Electrical and Computer Engineering, University of Toronto, Canada {junli, bli}@ece.toronto.edu Abstract

More information

International Journal of Innovations in Engineering and Technology (IJIET)

International Journal of Innovations in Engineering and Technology (IJIET) RTL Design and Implementation of Erasure Code for RAID system Chethan.K 1, Dr.Srividya.P 2, Mr.Sivashanmugam Krishnan 3 1 PG Student, Department Of ECE, R. V. College Engineering, Bangalore, India. 2 Associate

More information

An Efficient Penalty-Aware Cache to Improve the Performance of Parity-based Disk Arrays under Faulty Conditions

An Efficient Penalty-Aware Cache to Improve the Performance of Parity-based Disk Arrays under Faulty Conditions 1 An Efficient Penalty-Aware Cache to Improve the Performance of Parity-based Disk Arrays under Faulty Conditions Shenggang Wan, Xubin He, Senior Member, IEEE, Jianzhong Huang, Qiang Cao, Member, IEEE,

More information

An Efficient Distributed B-tree Index Method in Cloud Computing

An Efficient Distributed B-tree Index Method in Cloud Computing Send Orders for Reprints to reprints@benthamscience.ae The Open Cybernetics & Systemics Journal, 214, 8, 32-38 32 Open Access An Efficient Distributed B-tree Index Method in Cloud Computing Huang Bin 1,*

More information

PCM: A Parity-check Matrix Based Approach to Improve Decoding Performance of XOR-based Erasure Codes

PCM: A Parity-check Matrix Based Approach to Improve Decoding Performance of XOR-based Erasure Codes 15 IEEE th Symposium on Reliable Distributed Systems : A Parity-check Matrix Based Approach to Improve Decoding Performance of XOR-based Erasure Codes Yongzhe Zhang, Chentao Wu, Jie Li, Minyi Guo Shanghai

More information

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures Haiyang Shi, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda {shi.876, lu.932, panda.2}@osu.edu The Ohio State University

More information

SCALING UP OF E-MSR CODES BASED DISTRIBUTED STORAGE SYSTEMS WITH FIXED NUMBER OF REDUNDANCY NODES

SCALING UP OF E-MSR CODES BASED DISTRIBUTED STORAGE SYSTEMS WITH FIXED NUMBER OF REDUNDANCY NODES SCALING UP OF E-MSR CODES BASED DISTRIBUTED STORAGE SYSTEMS WITH FIXED NUMBER OF REDUNDANCY NODES Haotian Zhao, Yinlong Xu and Liping Xiang School of Computer Science and Technology, University of Science

More information

MODERN storage systems, such as GFS [1], Windows

MODERN storage systems, such as GFS [1], Windows IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. 1, JANUARY 2017 127 Short Code: An Efficient RAID-6 MDS Code for Optimizing Degraded Reads and Partial Stripe Writes Yingxun Fu, Jiwu Shu, Senior Member, IEEE,

More information

Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems

Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems Cheng Huang, Minghua Chen, and Jin Li Microsoft Research, Redmond, WA 98052 Abstract To flexibly explore

More information

On Coding Techniques for Networked Distributed Storage Systems

On Coding Techniques for Networked Distributed Storage Systems On Coding Techniques for Networked Distributed Storage Systems Frédérique Oggier frederique@ntu.edu.sg Nanyang Technological University, Singapore First European Training School on Network Coding, Barcelona,

More information

Design and Implementation of a Random Access File System for NVRAM

Design and Implementation of a Random Access File System for NVRAM This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented. IEICE Electronics Express, Vol.* No.*,*-* Design and Implementation of a Random Access

More information

Comparison of RAID-6 Erasure Codes

Comparison of RAID-6 Erasure Codes Comparison of RAID-6 Erasure Codes Dimitri Pertin, Alexandre Van Kempen, Benoît Parrein, Nicolas Normand To cite this version: Dimitri Pertin, Alexandre Van Kempen, Benoît Parrein, Nicolas Normand. Comparison

More information

RESAR: Reliable Storage at Exabyte Scale Reconsidered

RESAR: Reliable Storage at Exabyte Scale Reconsidered RESAR: Reliable Storage at Exabyte Scale Reconsidered Thomas Schwarz, SJ, Ahmed Amer, John Rose Marquette University, Milwaukee, WI, thomas.schwarz@marquette.edu Santa Clara University, Santa Clara, CA,

More information

DiskReduce: RAID for Data-Intensive Scalable Computing (CMU-PDL )

DiskReduce: RAID for Data-Intensive Scalable Computing (CMU-PDL ) Research Showcase @ CMU Parallel Data Laboratory Research Centers and Institutes 11-2009 DiskReduce: RAID for Data-Intensive Scalable Computing (CMU-PDL-09-112) Bin Fan Wittawat Tantisiriroj Lin Xiao Garth

More information

RobuSTore: Robust Performance for Distributed Storage Systems

RobuSTore: Robust Performance for Distributed Storage Systems RobuSTore: Robust Performance for Distributed Storage Systems Huaxia Xia and Andrew A. Chien University of California, San Diego {hxia, achien}@ucsd.edu Abstract *1 Emerging large-scale scientific applications

More information

Storage and Network Resource Usage in Reactive and Proactive Replicated Storage Systems

Storage and Network Resource Usage in Reactive and Proactive Replicated Storage Systems Storage and Network Resource Usage in Reactive and Proactive Replicated Storage Systems Rossana Motta and Joseph Pasquale Department of Computer Science and Engineering University of California San Diego

More information

Efficient Load Balancing and Disk Failure Avoidance Approach Using Restful Web Services

Efficient Load Balancing and Disk Failure Avoidance Approach Using Restful Web Services Efficient Load Balancing and Disk Failure Avoidance Approach Using Restful Web Services Neha Shiraz, Dr. Parikshit N. Mahalle Persuing M.E, Department of Computer Engineering, Smt. Kashibai Navale College

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

PITR: An Efficient Single-failure Recovery Scheme for PIT-Coded Cloud Storage Systems

PITR: An Efficient Single-failure Recovery Scheme for PIT-Coded Cloud Storage Systems PITR: An Efficient Single-failure Recovery Scheme for PIT-Coded Cloud Storage Systems Peng Li, Jiaxiang Dong, Xueda Liu, Gang Wang, Zhongwei Li, Xiaoguang Liu Nankai-Baidu Joint Lab, College of Computer

More information

Live Virtual Machine Migration with Efficient Working Set Prediction

Live Virtual Machine Migration with Efficient Working Set Prediction 2011 International Conference on Network and Electronics Engineering IPCSIT vol.11 (2011) (2011) IACSIT Press, Singapore Live Virtual Machine Migration with Efficient Working Set Prediction Ei Phyu Zaw

More information

A Performance Evaluation of Open Source Erasure Codes for Storage Applications

A Performance Evaluation of Open Source Erasure Codes for Storage Applications A Performance Evaluation of Open Source Erasure Codes for Storage Applications James S. Plank Catherine D. Schuman (Tennessee) Jianqiang Luo Lihao Xu (Wayne State) Zooko Wilcox-O'Hearn Usenix FAST February

More information

Randomized Network Coding in Distributed Storage Systems with Layered Overlay

Randomized Network Coding in Distributed Storage Systems with Layered Overlay 1 Randomized Network Coding in Distributed Storage Systems with Layered Overlay M. Martalò, M. Picone, M. Amoretti, G. Ferrari, and R. Raheli Department of Information Engineering, University of Parma,

More information

A RANDOMLY EXPANDABLE METHOD FOR DATA LAYOUT OF RAID STORAGE SYSTEMS. Received October 2017; revised February 2018

A RANDOMLY EXPANDABLE METHOD FOR DATA LAYOUT OF RAID STORAGE SYSTEMS. Received October 2017; revised February 2018 International Journal of Innovative Computing, Information and Control ICIC International c 2018 ISSN 1349-4198 Volume 14, Number 3, June 2018 pp. 1079 1094 A RANDOMLY EXPANDABLE METHOD FOR DATA LAYOUT

More information

Storage vs Repair Bandwidth for Network Erasure Coding in Distributed Storage Systems

Storage vs Repair Bandwidth for Network Erasure Coding in Distributed Storage Systems Storage vs Repair Bandwidth for Network Erasure Coding in Distributed Storage Systems 1 Swati Mittal Singal, 2 Nitin Rakesh, MIEEE, MACM, MSIAM, LMCSI, MIAENG 1, 2 Department of Computer Science and Engineering,

More information

Modern Erasure Codes for Distributed Storage Systems

Modern Erasure Codes for Distributed Storage Systems Modern Erasure Codes for Distributed Storage Systems Storage Developer Conference, SNIA, Bangalore Srinivasan Narayanamurthy Advanced Technology Group, NetApp May 27 th 2016 1 Everything around us is changing!

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [DYNAMO & GOOGLE FILE SYSTEM] Frequently asked questions from the previous class survey What s the typical size of an inconsistency window in most production settings? Dynamo?

More information

Your Data is in the Cloud: Who Exactly is Looking After It?

Your Data is in the Cloud: Who Exactly is Looking After It? Your Data is in the Cloud: Who Exactly is Looking After It? P Vijay Kumar Dept of Electrical Communication Engineering Indian Institute of Science IISc Open Day March 4, 2017 1/33 Your Data is in the Cloud:

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Storage systems have grown to the point where failures are inevitable,

Storage systems have grown to the point where failures are inevitable, A Brief Primer JAMES S. P LANK James S. Plank received his BS from Yale University in 1988 and his PhD from Princeton University in 1993. He is a professor in the Electrical Engineering and Computer Science

More information

L22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld

L22: NoSQL. CS3200 Database design (sp18 s2)   4/5/2018 Several slides courtesy of Benny Kimelfeld L22: NoSQL CS3200 Database design (sp18 s2) https://course.ccs.neu.edu/cs3200sp18s2/ 4/5/2018 Several slides courtesy of Benny Kimelfeld 2 Outline 3 Introduction Transaction Consistency 4 main data models

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Rethinking Erasure Codes for Cloud File Systems: Minimizing I/O for Recovery and Degraded Reads

Rethinking Erasure Codes for Cloud File Systems: Minimizing I/O for Recovery and Degraded Reads Rethinking Erasure Codes for Cloud File Systems: Minimizing I/O for Recovery and Degraded Reads Osama Khan, Randal Burns, James S. Plank, William Pierce and Cheng Huang FAST 2012: 10 th USENIX Conference

More information

SPA: On-Line Availability Upgrades for Paritybased RAIDs through Supplementary Parity Augmentations

SPA: On-Line Availability Upgrades for Paritybased RAIDs through Supplementary Parity Augmentations University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln CSE Technical reports Computer Science and Engineering, Department of 2-20-2009 SPA: On-Line Availability Upgrades for Paritybased

More information

Study of Load Balancing Schemes over a Video on Demand System

Study of Load Balancing Schemes over a Video on Demand System Study of Load Balancing Schemes over a Video on Demand System Priyank Singhal Ashish Chhabria Nupur Bansal Nataasha Raul Research Scholar, Computer Department Abstract: Load balancing algorithms on Video

More information

Preliminary Research on Distributed Cluster Monitoring of G/S Model

Preliminary Research on Distributed Cluster Monitoring of G/S Model Available online at www.sciencedirect.com Physics Procedia 25 (2012 ) 860 867 2012 International Conference on Solid State Devices and Materials Science Preliminary Research on Distributed Cluster Monitoring

More information

Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions

Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions James S. Plank EECS Department University of Tennessee plank@cs.utk.edu Kevin M. Greenan EMC Backup Recovery Systems Division Kevin.Greenan@emc.com

More information

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC INTRODUCTION With the EPYC processor line, AMD is expected to take a strong position in the server market including

More information

Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-level Transparent Local Erasure Coding

Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-level Transparent Local Erasure Coding Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-level Transparent Local Erasure Coding Yin Li, Hao Wang, Xuebin Zhang, Ning Zheng, Shafa Dahandeh,

More information

Oasis: An Active Storage Framework for Object Storage Platform

Oasis: An Active Storage Framework for Object Storage Platform Oasis: An Active Storage Framework for Object Storage Platform Yulai Xie 1, Dan Feng 1, Darrell D. E. Long 2, Yan Li 2 1 School of Computer, Huazhong University of Science and Technology Wuhan National

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

ERASURE coding has been commonly adopted in distributed

ERASURE coding has been commonly adopted in distributed IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 11, NOVEMBER 2016 3311 Parity-Switched Data Placement: Optimizing Partial Stripe Writes in XOR-Coded Storage Systems Zhirong Shen, Jiwu

More information

Modern Erasure Codes for Distributed Storage Systems

Modern Erasure Codes for Distributed Storage Systems Modern Erasure Codes for Distributed Storage Systems Srinivasan Narayanamurthy (Srini) NetApp Everything around us is changing! r The Data Deluge r Disk capacities and densities are increasing faster than

More information

Cooperative Pipelined Regeneration in Distributed Storage Systems

Cooperative Pipelined Regeneration in Distributed Storage Systems Cooperative ipelined Regeneration in Distributed Storage Systems Jun Li, in Wang School of Computer Science Fudan University, China Baochun Li Department of Electrical and Computer Engineering University

More information

4 Exploiting Redundancies and Deferred Writes to Conserve Energy in Erasure-Coded Storage Clusters

4 Exploiting Redundancies and Deferred Writes to Conserve Energy in Erasure-Coded Storage Clusters 4 Exploiting Redundancies and Deferred Writes to Conserve Energy in Erasure-Coded Storage Clusters JIANZHONG HUANG and FENGHAO ZHANG, Huazhong University of Science and Technology XIAO QIN, Auburn University

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

Lenovo RAID Introduction Reference Information

Lenovo RAID Introduction Reference Information Lenovo RAID Introduction Reference Information Using a Redundant Array of Independent Disks (RAID) to store data remains one of the most common and cost-efficient methods to increase server's storage performance,

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

Definition of RAID Levels

Definition of RAID Levels RAID The basic idea of RAID (Redundant Array of Independent Disks) is to combine multiple inexpensive disk drives into an array of disk drives to obtain performance, capacity and reliability that exceeds

More information

Reducing The De-linearization of Data Placement to Improve Deduplication Performance

Reducing The De-linearization of Data Placement to Improve Deduplication Performance Reducing The De-linearization of Data Placement to Improve Deduplication Performance Yujuan Tan 1, Zhichao Yan 2, Dan Feng 2, E. H.-M. Sha 1,3 1 School of Computer Science & Technology, Chongqing University

More information

COMS: Customer Oriented Migration Service

COMS: Customer Oriented Migration Service Boise State University ScholarWorks Computer Science Faculty Publications and Presentations Department of Computer Science 1-1-217 COMS: Customer Oriented Migration Service Kai Huang Boise State University

More information

Analyzing and Improving Load Balancing Algorithm of MooseFS

Analyzing and Improving Load Balancing Algorithm of MooseFS , pp. 169-176 http://dx.doi.org/10.14257/ijgdc.2014.7.4.16 Analyzing and Improving Load Balancing Algorithm of MooseFS Zhang Baojun 1, Pan Ruifang 1 and Ye Fujun 2 1. New Media Institute, Zhejiang University

More information

Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage

Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage Jeremy C. W. Chan, Qian Ding, Patrick P. C. Lee, and Helen H. W. Chan, The Chinese University

More information

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop Myoungjin Kim 1, Seungho Han 1, Jongjin Jung 3, Hanku Lee 1,2,*, Okkyung Choi 2 1 Department of Internet and Multimedia Engineering,

More information

Research on Implement Snapshot of pnfs Distributed File System

Research on Implement Snapshot of pnfs Distributed File System Applied Mathematics & Information Sciences An International Journal 2011 NSP 5 (2) (2011), 179S-185S Research on Implement Snapshot of pnfs Distributed File System Liu-Chao, Zhang-Jing Wang, Liu Zhenjun,

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

Correlation-aware Prefetching in Fault-tolerant Distributed Object-based File System

Correlation-aware Prefetching in Fault-tolerant Distributed Object-based File System Journal of Computational Information Systems 8: 16 (212) 1 8 Available at http://www.jofcis.com Correlation-aware Prefetching in Fault-tolerant Distributed Object-based File System Jiancong TONG, Bin ZHANG,

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

RAIDX: RAID without striping

RAIDX: RAID without striping RAIDX: RAID without striping András Fekete University of New Hampshire afekete@wildcats.unh.edu Elizabeth Varki University of New Hampshire varki@cs.unh.edu Abstract Each disk of traditional RAID is logically

More information

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,

More information

Research on Design and Application of Computer Database Quality Evaluation Model

Research on Design and Application of Computer Database Quality Evaluation Model Research on Design and Application of Computer Database Quality Evaluation Model Abstract Hong Li, Hui Ge Shihezi Radio and TV University, Shihezi 832000, China Computer data quality evaluation is the

More information

Erasure Codes for Heterogeneous Networked Storage Systems

Erasure Codes for Heterogeneous Networked Storage Systems Erasure Codes for Heterogeneous Networked Storage Systems Lluís Pàmies i Juárez Lluís Pàmies i Juárez lpjuarez@ntu.edu.sg . Introduction Outline 2. Distributed Storage Allocation Problem 3. Homogeneous

More information

The material in this lecture is taken from Dynamo: Amazon s Highly Available Key-value Store, by G. DeCandia, D. Hastorun, M. Jampani, G.

The material in this lecture is taken from Dynamo: Amazon s Highly Available Key-value Store, by G. DeCandia, D. Hastorun, M. Jampani, G. The material in this lecture is taken from Dynamo: Amazon s Highly Available Key-value Store, by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall,

More information

Network Design Considerations for Grid Computing

Network Design Considerations for Grid Computing Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom

More information

GearDB: A GC-free Key-Value Store on HM-SMR Drives with Gear Compaction

GearDB: A GC-free Key-Value Store on HM-SMR Drives with Gear Compaction GearDB: A GC-free Key-Value Store on HM-SMR Drives with Gear Compaction Ting Yao 1,2, Jiguang Wan 1, Ping Huang 2, Yiwen Zhang 1, Zhiwen Liu 1 Changsheng Xie 1, and Xubin He 2 1 Huazhong University of

More information

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID System Upgrade Teaches RAID In the growing computer industry we often find it difficult to keep track of the everyday changes in technology. At System Upgrade, Inc it is our goal and mission to provide

More information

A Remote Hot Standby System of Oracle

A Remote Hot Standby System of Oracle 2012 International Conference on Image, Vision and Computing (ICIVC 2012) IPCSIT vol. 50 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V50.51 A Remote Hot Standby System of Oracle Qiu

More information

A survey on regenerating codes

A survey on regenerating codes International Journal of Scientific and Research Publications, Volume 4, Issue 11, November 2014 1 A survey on regenerating codes V. Anto Vins *, S.Umamageswari **, P.Saranya ** * P.G Scholar, Department

More information

An Efficient Storage Mechanism to Distribute Disk Load in a VoD Server

An Efficient Storage Mechanism to Distribute Disk Load in a VoD Server An Efficient Storage Mechanism to Distribute Disk Load in a VoD Server D.N. Sujatha 1, K. Girish 1, K.R. Venugopal 1,andL.M.Patnaik 2 1 Department of Computer Science and Engineering University Visvesvaraya

More information

A Hybrid Scheme for Object Allocation in a Distributed Object-Storage System

A Hybrid Scheme for Object Allocation in a Distributed Object-Storage System A Hybrid Scheme for Object Allocation in a Distributed Object-Storage System Fang Wang **, Shunda Zhang, Dan Feng, Hong Jiang, Lingfang Zeng, and Song Lv Key Laboratory of Data Storage System, Ministry

More information

Remote Direct Storage Management for Exa-Scale Storage

Remote Direct Storage Management for Exa-Scale Storage , pp.15-20 http://dx.doi.org/10.14257/astl.2016.139.04 Remote Direct Storage Management for Exa-Scale Storage Dong-Oh Kim, Myung-Hoon Cha, Hong-Yeon Kim Storage System Research Team, High Performance Computing

More information

All About Erasure Codes: - Reed-Solomon Coding - LDPC Coding. James S. Plank. ICL - August 20, 2004

All About Erasure Codes: - Reed-Solomon Coding - LDPC Coding. James S. Plank. ICL - August 20, 2004 All About Erasure Codes: - Reed-Solomon Coding - LDPC Coding James S. Plank Logistical Computing and Internetworking Laboratory Department of Computer Science University of Tennessee ICL - August 2, 24

More information

An Architectural Approach to Improving the Availability of Parity-Based RAID Systems

An Architectural Approach to Improving the Availability of Parity-Based RAID Systems Computer Science and Engineering, Department of CSE Technical reports University of Nebraska - Lincoln Year 2007 An Architectural Approach to Improving the Availability of Parity-Based RAID Systems Lei

More information

Shaking Service Requests in Peer-to-Peer Video Systems

Shaking Service Requests in Peer-to-Peer Video Systems Service in Peer-to-Peer Video Systems Ying Cai Ashwin Natarajan Johnny Wong Department of Computer Science Iowa State University Ames, IA 500, U. S. A. E-mail: {yingcai, ashwin, wong@cs.iastate.edu Abstract

More information

Evaluating Auto Scalable Application on Cloud

Evaluating Auto Scalable Application on Cloud Evaluating Auto Scalable Application on Cloud Takashi Okamoto Abstract Cloud computing enables dynamic scaling out of system resources, depending on workloads and data volume. In addition to the conventional

More information

Decentralized Distributed Storage System for Big Data

Decentralized Distributed Storage System for Big Data Decentralized Distributed Storage System for Big Presenter: Wei Xie -Intensive Scalable Computing Laboratory(DISCL) Computer Science Department Texas Tech University Outline Trends in Big and Cloud Storage

More information

Statistical Performance Comparisons of Computers

Statistical Performance Comparisons of Computers Tianshi Chen 1, Yunji Chen 1, Qi Guo 1, Olivier Temam 2, Yue Wu 1, Weiwu Hu 1 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology (ICT), Chinese Academy of Sciences, Beijing,

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP 03, October 19 22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae Woo Memory & Storage Architecture Lab. School

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University Outline Introduction and Motivation Our Design System and Implementation

More information

Multi-path based Algorithms for Data Transfer in the Grid Environment

Multi-path based Algorithms for Data Transfer in the Grid Environment New Generation Computing, 28(2010)129-136 Ohmsha, Ltd. and Springer Multi-path based Algorithms for Data Transfer in the Grid Environment Muzhou XIONG 1,2, Dan CHEN 2,3, Hai JIN 1 and Song WU 1 1 School

More information

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3 EECS 262a Advanced Topics in Computer Systems Lecture 3 Filesystems (Con t) September 10 th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering and Computer Sciences University of California,

More information

AOS: Adaptive Out-of-order Scheduling for Write-caused Interference Reduction in Solid State Disks

AOS: Adaptive Out-of-order Scheduling for Write-caused Interference Reduction in Solid State Disks , March 18-20, 2015, Hong Kong AOS: Adaptive Out-of-order Scheduling for Write-caused Interference Reduction in Solid State Disks Pingguo Li, Fei Wu*, You Zhou, Changsheng Xie, Jiang Yu Abstract The read/write

More information

Software-defined Storage: Fast, Safe and Efficient

Software-defined Storage: Fast, Safe and Efficient Software-defined Storage: Fast, Safe and Efficient TRY NOW Thanks to Blockchain and Intel Intelligent Storage Acceleration Library Every piece of data is required to be stored somewhere. We all know about

More information

Low Complexity Opportunistic Decoder for Network Coding

Low Complexity Opportunistic Decoder for Network Coding Low Complexity Opportunistic Decoder for Network Coding Bei Yin, Michael Wu, Guohui Wang, and Joseph R. Cavallaro ECE Department, Rice University, 6100 Main St., Houston, TX 77005 Email: {by2, mbw2, wgh,

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Australian Journal of Basic and Applied Sciences Journal home page: www.ajbasweb.com A Review on Raid Levels Implementation and Comparisons P. Sivakumar and K. Devi Department of Computer

More information

A Joint Replication-Migration-based Routing in Delay Tolerant Networks

A Joint Replication-Migration-based Routing in Delay Tolerant Networks A Joint -Migration-based Routing in Delay Tolerant Networks Yunsheng Wang and Jie Wu Dept. of Computer and Info. Sciences Temple University Philadelphia, PA 19122 Zhen Jiang Dept. of Computer Science West

More information

Research on Availability of Virtual Machine Hot Standby based on Double Shadow Page Tables

Research on Availability of Virtual Machine Hot Standby based on Double Shadow Page Tables International Conference on Computer, Networks and Communication Engineering (ICCNCE 2013) Research on Availability of Virtual Machine Hot Standby based on Double Shadow Page Tables Zhiyun Zheng, Huiling

More information