Storage vs Repair Bandwidth for Network Erasure Coding in Distributed Storage Systems

Size: px

Start display at page:

Download "Storage vs Repair Bandwidth for Network Erasure Coding in Distributed Storage Systems"

Henry Thompson
6 years ago
Views:

1 Storage vs Repair Bandwidth for Network Erasure Coding in Distributed Storage Systems 1 Swati Mittal Singal, 2 Nitin Rakesh, MIEEE, MACM, MSIAM, LMCSI, MIAENG 1, 2 Department of Computer Science and Engineering, ASET, Amity University Uttar Pradesh, Noida, India 1 ssingal@amity.edu, 2 {nrakesh@amity.edu, nitin.rakesh@gmail.com, nitin.rakesh@ieee.org} Abstract Network coding is used in peer-to-peer storage systems, archival storage, wireless networks, satellite communication, video conferencing etc. Storage system stores data at different locations. For the data to be available, durable and reliable, it must be able to recover from failures efficiently. Different approaches applied on storage systems are examined and evaluated in this paper. Keeping replicas of the data at multiple places is traditional technique used by major storage systems. To reduce the amount of storage required by replication the distributed system is now transitioning towards, the erasure codes. Several approaches like the hybrid and regenerating codes provide solution to storage and repair bandwidth. But still improvement in terms of communication cost in the face of failures is required. These approaches and main application areas of these approaches are examined and analyzed in this paper. A comparative analysis based on storage requirement, disk access, repair bandwidth and unavailability probability is also done. Keywords Distributed Storage System,, Erasure Codes, Regenerating codes I. INTRODUCTION Network error correction or recovering from failures has been of interest over decades. The traditional approach of error-correction codes is a link by link error correction technique [1]. It controls errors by adding redundancies in the time domain. When failure occurs during transmission the packets are retransmitted by the sender. The process is thus time consuming. Hence, the conception of network error correction codes was introduced as a simplification of the classical error-correction codes. Network error correction code (NEC), control errors by introducing redundancies in the space domain [2, 3]. The data object of size M is broken into k fragments and r redundant units are added onto it in such a manner that only k data units out of the total data units, are enough to recover form failure. This makes the system resilient to r data unit failures. Network error correction code is applicable in a variety of fields. These include network communication, satellite communication, video conferencing, peer to peer file sharing systems, distributed storage, wireless sensor networks, data grids, archival storage [5-7] and many more. With the emerging trends of cloud and big data, the digital data is 3 Rakesh Matam, MIEEE, MACM 3 Department of Computer Science and Engineering, Indian Institute of Information Technology (IIIT), Guwahati, India 3 rakesh@iiitg.ac.in growing very fast and is expected to be ten times in seven years. This demands the efficient storage and recovery in data centers. Hence large scale Distributed Storage Systems (DSS) are currently transitioning to erasure codes. This paper describes various approaches that are used in distributed storage systems. A comparison between these approaches based on the application areas and parameters like, storage required, disk access, repair bandwidth and unavailability probability is presented. The analysis of these approaches and the problems with distributed storage systems is also presented. Distributed storage systems used replication to make data more available, reliable and durable, in this paper we have examined various replication methods and compared based on usability by many storage requirement applications. The paper is organized as follows. Section first introduces various distributed storage approaches and complications in terms of performance. In section second different models used for distributed storage systems have been described. Section third compares these approaches based on storage overheads, disk access for repair/regeneration and reconstruction, cost of communication and unavailability probability. In section fourth we have performed the quantitative analysis of these approaches. Section fifth concludes the findings of comparison and analysis done in the paper. II. MODELS In a distributed storage system, the data is divided into fragments and distributed over several nodes. This data is required to be communicated to another node in the network, in two situations; first, when the user requests for a particular data, the nodes that have the required data participate in communication and second, when a node fails, a new node takes its place and recovers the data by communicating the surviving nodes. Depending on the application environment different approaches are used that provide efficient storage, less repair bandwidth and more reliability to the system. The different approaches used in distributed storage systems are described below: A. One method to recover from failure is to maintain a full replica of the data object that can be used at the time of

2 recovery [8]. technique is usually used by Amazon Dynamo, Google file System and Cassandra of Facebook and other file storage systems to provide better data availability and durability. Several copies of the same data are replicated at different nodes. These copies can serve multiple clients concurrently. If any node fails, then it can be recovered from the replica which is maintained at other nodes. This redundant data may lead to inconsistences and also involves a large amount of storage overhead. The main question here is to identify how many replicas are enough for the system so that no data is lost? This has been experimentally proved that, on the PlanetLab trace the data is lost only when replicas are less than three. This is due to the fact that, during the failure of the node if the creation rate of creating a new replica is slightly greater than the average failure rate, the data will be lost before recovery [9]. Thus most of the applications maintain three replicas. The main advantage of using replication is that no encoding/ decoding is involved and the design of system is easy. If a file of size M bytes maintains R replicas, then it stores a total of M R bytes with M bytes of storage per node [6, 11]. We only need one replica out of R replicas to recover the data in the face of failure. The file is said to be unavailable if no replica is available. The unavailability probability where is the mean node availability [6, 10]. Two types of failure may occur in the system namely: permanent failure (in which data is lost due to disk failure or permanent departure of the node) and transient failure (may occur due to temporary network problem). The DHash algorithm responds to both permanent and transient failures to provide 100% availability and thus the bandwidth is wasted if the failure is temporary. Carbonite is an efficient replication algorithm that ignores transient failures [9]. B. Erasure Codes The major drawback of using replication strategy is large storage overhead which is three times the original data. Erasure codes reduce this storage requirement to about 1.4 times the original data. A data object of size bytes is distributed into analogous fragments and encoded in fragments which are stored separately, where. It is capable of correcting erasure (errors whose location is known). Erasure adds encoding/decoding and update complexities to the system. OceanStore, Cleversafe, Facebook-HDFS RAID, HP and IBM use erasure codes i.e. X-codes, Star codes, EVENODD, RDP, Reed-Solomon codes [12-16] to mention a few. The EVENODD, RDP, X-codes are all RAID6 codes. The encoding/decoding performance of RDP is better than other codes. The network erasure codes are applied in the areas like digital file distribution and peer-to-peer file sharing-avalanche from Microsoft, distributed storage, wireless mesh network, adhoc sensor network, satellite communication, video conferencing, disk array systems and archival storage [5]. The erasure stores bytes of data in each node and the total storage becomes bytes. The rate of code,. The unavailability probability of erasure code (when less than k nodes are available) is given by, ( ) [6, 11]. The repair bandwidth of erasure codes is M, which is not optimal because the newcomer node that replaces the failed node has to connect to any k surviving nodes to download the entire message of bytes from each node just to recover its own bytes [10]. C. Stratagy In addition to (n,k) erasure coded fragment, hybrid strategy use one full replica which adds redundancies to original [6,11]. This adds complexity to the system design by using two types of redundancies. Storage required is. At the time of failure, only a single node generates new fragment of size bytes and send it to the newcomer, which enables transferring bytes [10]. As compared to replication and erasure codes discussed above, the repair bandwidth is k times less in case of hybrid approach. The data is unavailable if the replicas of the data is unavailable or if less than k erasure coded nodes are available. The unavailability probability in case of hybrid approach is (. In some text hybrid approach may consider combining two approaches of erasure codes like RDP and EVENODD. This paper, consider hybrid as a combination of replication and erasure codes. In high agitate environment, in which high rate of nodes join and leave the system, the bandwidth cost is too high. In low agitate situation lack of bandwidth is unimportant, but in moderate agitate, hybrid approach is beneficial [6, 11]. D. Regenerating Codes Erasure codes and hybrid approach provide a better solution for storage and repair bandwidth but these are not optimal. In a distributed storage environment the nodes are replaced periodically during failure, so there is a need of such codes which can generate codes by communicating as diminutive data across the network. In Regenerating codes (RC) the data can be regenerated by communicating data in the surviving disks, with minimum communication cost. Regenerating codes follow the condition of MDS, so the minimum disks required for repair is k but it can be maximum. If less than k nodes are available the repair is not possible. The unavailability probability of RC codes is thus same as that of erasure codes, [6]. RC codes use to some extent bigger fragments than MDS but can reduce overall bandwidth by 25% compared to hybrid approach and also simplifies the system design [10]. Two important operations of RC codes are Reconstruction (connect to nodes to obtain a data of size M bytes) and Regeneration (connects to d surviving nodes to recover the data of failed node). In [6] feasible storage-repair bandwidth curve was plotted for RC codes with values (5, 10, 9) and (10, 15, 14). The curve had two special points; minimum storage and minimum bandwidth. The code that obtains the minimum storage is known as the Minimum Storage Regenerating (MSR) codes, and the codes that attain the best repair overhead are called the Minimum Bandwidth Regenerating (MBR) codes. For MSR the storage per node is. At the time of failure the newcomer, unlike the erasure codes connects to d nodes the value of which should be greater that k and less than

3 . The repair bandwidth of MSR is given by γ =. If the surviving nodes, d, then the repair bandwidth and if surviving nodes, then the repair bandwidth becomes γ =. Thus the cost of repair communication is minimized for. The MBR point at which the minimum bandwidth is obtained has storage α = and repair bandwidth γ=. Note that the storage,, for minimum bandwidth, which behaves like replication system, thus communicating exactly the same amount of data as is stored by the node. In case of failure, if the surviving nodes,, then the storage and bandwidth values becomes and if the surviving nodes,, then the values of storage and repair bandwidth is [6]. Table 1, gives the applications where each of the above approaches is used. TABLE I. APPLICABILITY Approach Application [17-30] Google File System, Cassandra in Facebook, Amazon Dynamo Sprite File System, Farsite Fils system, Coda, Bayou database system, Myriad, Locus, TotalRecall, Harp File system. Erasure codes RC codes III. Glacier, CleverSafe, HP, IBM, HDFS- RAID, wireless mesh network, Windows Azure System, digital file distribution and peer-to-peer file sharingavalanche from Microsoft, Adhoc Sensor Network, satellite communication, video conferencing, disk array systems and archival storage. DHT, OceanStore, P2P systems like PAST, Farsite. NCCloud, P2P backup systems, archival storage. COMPARISON OF DIFFERENT NETWORK CODING TECHNIQUES USED IN DSS This paper presents a comparison of the four approaches of network coding used in DSS that were discussed in the previous section. The comparison table for replication, erasure codes,, MSR and MBR for both and is given in the table 2. Based on the comparative table, graph has been plotted for all the approaches. The x-axis shows the increasing configuration for which the value of is fixed to 4 and the y-axis shows the parameters on which these approaches has been compared. The figure 1, depicts these parameters, considering the Data object of size M bytes divided into k fragments and distributed over n nodes ( ). To reconstruct the data the Data Collector (DC) connects to any k disks and to repair a failed node, the newcomer that takes its place connects to d surviving node, ( to recover the data. The parameters of comparison are below in detail: A. Storage Requirements Storage requirement means the total space required (in bytes) to store the data object, so that it is available all times. In a distributed storage, if a data object of size, divided into fragments is distributed over n nodes, the total storage required to store the original data object will vary on the approach used. The graph in figure 2 shows the storage requirement to store a data object of size divided into k fragments and then distributed over n disks. In case of replication (3 replicas), the storage required to store 2Mb does not depend on n and remains 6Mb for all configurations. For all other approaches, the storage requirement decreases as the value for grows. Erasure codes and MSR codes require same storage and the values are optimal compared to all the other approaches. B. Disk access for reconstruction When the data is required by the user, the Data Collector is connected to other nodes to recreate the data that is distributed over several nodes. The number of nodes, required to recreate the data is called the disk access for reconstruction. For an MDS code it connects to any k nodes to reconstruct the data distributed over n nodes. But in case of replication and hybrid codes the data collector connects to a single node where the data or its replication is placed. The graph in figure 3 shows that the number of disk access grows with increasing value of configuration. The value is same for erasure codes, MSR and MBR, which is k disks. C. Disk access for repair or regeneration At the time of failure, the newcomer node takes its place. By connecting to the surviving nodes (working, ) the lost data of the failed node is repaired. The number of disks required to be accessed for regenerating the new node, varies for different approaches. The graph for these approaches is in figure 4. For replication and hybrid approach the number of disk access for all configuration is 1 disk. Disk access for erasure and MBR codes follow MDS property and therefore the value is k disks. For MSR and MBR, the number of disks access depends on n. D. Repair Bandwidth Repair bandwidth is the cost of communication at the time of failure. It depends on several nodes the newcomer node connects to and the data in bytes transferred by these nodes. The figure 5 shows the repair bandwidth curve for all approaches with increasing ( ) configuration, when the size of data object M=2Mb. In case of replication and erasure codes, the repair bandwidth is equivalent to the size of data object i.e. 2Mb. This value is not optimal and the other approaches give better solutions to reduce the repair bandwidth. MBR has minimum repair bandwidth when it connects to surviving nodes. In this technique the repair bandwidth is a decreasing function of. The size of data communicated reduces with increasing value of, thus reducing the cost of communication for repair. It is clear from the figure, that for hybrid and regenerating codes (MSR, MBR) the repair bandwidth also decreases as the value of ( ) configuration increases. The reason is, when the data objects of size is stored over a large number of disks then the amount of storage per disk reduces, thus decreasing the bytes transferred at time of failure. The value of repair bandwidth is minimum for hybrid approach which is. If the replica of

4 data is available then, the loss is recovered by transferring only the part of data that was stored in failed node. E. Unavailability Probability In a distributed system, the data is unavailable if less than the required number of nodes is unavailable. In case of replication, data is unavailable if no replica is available. If is the probability of node being available then, is the probability of node not being available for R replicas. For all MDS codes the minimum required nodes is, k nodes. combines the result of replication and MDS codes (i.e. if less than nodes are available or no replica is available). Data Collector (DC) connects to any k nodes Disk access for reconstruction Data Object of size M bytes Data divided into k fragments k fragments encoded over n nodes n nodes store M k bytes data Node fails Disk access for repair or regeneration, k d n Newcomer node Repair bandwidth depend on disk access and bytes transferred by each node Fig. 1. Coding Structure: The data object of size M bytes is divided into k fragments and encoded over n nodes. (I) storage required: if each node store M/k bytes of data, then total storage in n nodes is (. (II) Disk access for reconstruction: in an MDS code the minimum nodes required to reconstruct the data is k. (III) Disk access for repair or regeneration: during failure of any node the newcomer node that takes its place connects to d survival nodes, where,. (IV) Repair Bandwidth: the cost of communication of data from d surviving nodes to newcomer node in order to recover the lost data. (V) Unavailability probability: the probability of less than the required number of nodes being available. TABLE II. COMPARISION TABLE Approaches (n,k) Erasure codes MSR MBR Parameters codes for Storage required MBR codes for Disk access for reconstruction Disk access for repair or regeneration 1 1 or 1 1 Repair bandwidth Unavailability probability IV. QUANTITATIVE ANALYSIS Consider the value of and in the ( ) configuration. The table 3 gives the values of storage require, disk access for reconstruction, disk access for regeneration and repair bandwidth for replication, erasure codes, hybrid, MSR MBR( ) and MBR( ) approaches discussed in this paper. It shows the values calculated using the formulas in table 2. MSR and erasure codes have the same storage requirement but the repair bandwidth of MSR is 73% less than erasure codes with only 3 more disk access. The storage requirement for hybrid, MBR ( ) and MBR ( ) is almost same with hybrid approach having the least value of 4.173Mb to store 2Mb of data. The repair bandwidth of and MBR approach is also very less compared to other approaches. The hybrid approach has about 46% less repair bandwidth compared to both approaches of MBR. In spite of the low values of storage and repair bandwidth the approach is not much used because of the complex system design and maintenance of two different kinds of error correction techniques. TABLE III. QUANTITATIVE COMAPRISON FOR AND IN CONFIGURATION Parameter values for Storage Require- Disk access for reconstruction Disk Access for repair or Repair Bandwidth n and ment (in regeneration k MB) Erasure Codes MSR MBR (d k) MBR (d n )

5 Disk access for repair Repair Bandwidth Storage requirement Disk access for Reconstruction V. CONCLUSION AND FUTURE WORK With the increase in the amount of data over the years, there is a need to reduce the storage requirement without affecting the availability and reliability of data. Erasure codes provide a simple solution to reduce the storage. Different erasure codes have been designed to reduce encoding/decoding/update complexities and to tolerate burst error of up to three or four failures. But there was no significant improvement in terms of repair bandwidth as also shown in this paper. We have analyzed in this paper that approach makes use of replication along with erasure codes and has better performance in terms of repair bandwidth. Furthermore, it is also shown that the main drawback of using this approach is the complex architecture of the system due to maintenance of two different techniques. The future work is combining the optimal erasure technique with replication may provide better results compared to other approaches Erasure Codes n k Erasure Codes Fig. 2. Storage Requirement Fig. 3. Disk access for Reconstruction Erasure Codes n k Erasure Codes Fig. 4. Disk access for Repair/Regeneration REFERENCES [1] N. Cai and R. W. Yeung, Network coding and error correction, in Prco. IEEE Inf. Theory Workshop, Banglore, India, Oct , 2002, pp [2] Z. Zhang, Theory and applications of network error correction coding, Proc. IEEE, vol. 99, no. 3, pp , March [3] Z Zhang. Linear Network Error Correction codes in Packet Networks, IEEE Transaction on Information Theory. Vol 54,No.1. Jan Fig. 5. Repair Bandwidth [4] R. Ahlswede, N. Cai, S.-Y. R. Li, and R. W. Yeung, Network information flow, IEEE Trans. Information Theory, 46(4): , July 2000 [5] J. S. Plank, Erasure codes for storage applications, in Tutorial, FAST- 2005: 4th Usenix Conference on File and StorageTechnologies, (San Francisco, CA), December [online] plank/plank/papers/fast-2005.html. [6] A.G. Dimakis, P.B. Godfrey, Y. Wu, M.J.Wainwright, and K. Ramchandran, Netword coding for distributed storage systems, IEEE Trans. On Information Theory, 56(9) pp , Sept 2010.

6 [7] J. Araujo, F. Giroire, and J. Monteiro. Approaches for Distributed Storage Systems. In Proceedings of Fourth International Conference on Data Management in Grid and P2P Systems (Globe'11), Toulouse, France, September [8] H. Weatherspoon and J. D. Kubiatowicz, Erasure coding vs. replication: a quantitiative compariso, In Proc. IPTPS, Mar [9] Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek, John Kubiatowicz, and Robert Morris, Efficient replica maintenance for distributed storage systems, In NSDI, [10] R. Rodrigues and B. Liskov, High availability in DHTs: Erasure coding vs. replication, In Proc. IPTPS, [11] A. G. Dimakis, P. B. Godfrey, M. J. Wainwright, and K. Ramchandran, "The benefits of network coding for peer-to-peer storage systems," in Third Workshop on Network Coding, Theory, and Applications, [12] S. Reed and G. Solomon, Polynomial Codes over Certain Finite Fields, J. SIAM, vol 8, no 10, pp , 1960 [13] M. Blaum, J. Brady, J. Bruck, J. Menon, and A. Vardy, The EVENODD code and its generalization,. in High Performance Mass Storage and Parallel I/O, pp John Wiley & Sons, INC., [14] C. Huang and L. Xu, STAR: An efficient coding scheme for correcting triple storage node failures, IEEE Transactions on Computers, vol. 57, no. 7, pp , [15] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong and S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of USENIX FAST 2004, Mar. 31 to Apr. 2, San Francisco,CA, USA. [16] Lihao Xu, X-Code: MDS Array Codes with Optimal Encoding, IEEE Transactions on Information Theory, 45 (1): , January [17] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao, OceanStore: An Architecture for Global-scale Persistent Storage, In ASPLOS 00: Proc. of the 9thInternational Conference on Architectural Support for Programming Languages and Operating Systems,December [18] S. Ghemawat, H. Gobioff and S. Leung, The Google File system, In Proc. of 19th ACM Symposium on Operating System Principles (Oct. 2003). [19] A. Adya, W. J. Bolosky, M. Castro, G. Cermak,R. Chaiken, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, and R. P. Wattenhofer, FARSITE: Federated, available, and reliable storage for an incompletely trusted environment, In Proc. OSDI, Boston, MA, Dec [20] F. Chang, M. Ji, S.-T. Leung, J. MacCormick, S. Perl, andl. Zhang, Myriad: Cost-effective disaster tolerance, In FAST 02:Proceedings of the 1st USENIX Conference on File and Storage Technologies, page 8, Berkeley, CA, USA, [21] B. Walker, G. Popek, R. English, C. Kline, and G. Thiel, The LOCUS distributed operating system, Proceedings Ninth Symposium on operating Systems Principles, Bretton Woods, New Hampshire, October 1983, pp [22] R. Bhagwan, K. Tati, Y. Cheng, C. Y, S. Savage and G. M. Voelker, Total Recall: System support for automated availability management, In Proc. of the 1st Symposium on Networked Systems Design and Implementation (Mar. 2004). [23] B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, I Shrira and M. Williams, in the Harp File system, In Proc. of the 13th ACM Symposium on Operating System Principles (Oct. 1991), pp [24] Haeberlen, A. Mislove and P. Druschel, Glacier: Highly durable, decentralized storage despite massive correlated failures, In Proc. of the 2nd Symposium on Networked Systems Design and Implementation (May 2005). [25] Y. Hu, H. C. H. Chen, P. P. C. Lee, and Y. Tang, NCCloud: Applying network coding for the storage repair in a cloud-of-clouds, in Proc. of the 10th USENIX Conf. on File and Storage Tech. (FAST 12), San Jose, Feb [26] Nitin Rakesh and Vipin Tyagi, Linear-code multicast on parallel architectures Elsevier Advances in Engineering Software, vol. 42, pp , [27] Nitin Rakesh and Vipin Tyagi, Efficient Broadcasting in Parallel Networks Using Network Coding, in Proceedings of The First International conference on Parallel, Distributed Computing technologies and Applications, CCIS 203, pp , Springer-Verlag, [28] Nitin Rakesh and Nitin, Analysis of All to All Broadcast on Multi Mesh of Trees Using Genetic Algorithm International Workshop on Advances in Computer Networks, VLSI, ANVIT, St. Petersbutg, Russia, [29] Nitin Rakesh and Nitin, Analysis of Multi-Sort Algorithm on Multi- Mesh of Trees (MMT) Architecture, Springer Journal of Supercomputing, vol 57, no 3, , [30] Nitin Rakesh and Vipin Tyagi Linear Network Coding on Multi-Mesh of Trees using All to All Broadcast International Journal of Computer Science Issues, vol 8, no 3, , 2011.

Storage and Network Resource Usage in Reactive and Proactive Replicated Storage Systems

Storage and Network Resource Usage in Reactive and Proactive Replicated Storage Systems Rossana Motta and Joseph Pasquale Department of Computer Science and Engineering University of California San Diego