Locally repairable codes

Size: px

Start display at page:

Download "Locally repairable codes"

Melissa Rose
5 years ago
Views:

1 Locally repairable codes for large-scale storage Presented by Anwitaman Datta Nanyang Technological University, Singapore Joint work with Frédérique Oggier & Lluís Pàmies i IISc 2012 & NetApp A. Datta, NTU Bangalore, SingaporeOctober 2012

2 Who am I? //sands ntu sg/ 2

3 S*Aspects of Networked Distributed Systems recommendation and decision support systems decentralized online social networking and collaboration Applica ations privacy aware/preserved data aggregation, storage, sharing & analytics/data-mining data/computation at 3 rd party/outsourced distributed key-value stores data-center design P2P/F2F storage systems networked distributed storage & data management systems (Distributed d) Systems social network analysis trust secure/privacy models preserved computation primitives codes for storage Founda ational

4 Large-scale storage: Disclaimer A note from the trenches: "You know you have a large storage system when you get paged at 1 AM because you only have a few petabytes of storage left." from Andrew Fikes (Principal Engineer, Google) faculty summit talk ` Storage Architecture and Challenges `, and some ask/say: why do you care about efficient storage space utilization, it is so cheap... I never get such calls!! 4

5 5 Source: data-center-expansion-plans

6 Scale how? To scale vertically (or scale up) means to add resources to a single node in a system* To scale horizontally (or scale out) means to add more nodes to a system, such as adding a new computer to a distributed software application* 6 Scale up Scale out * Definitions from Wikipedia

7 Distribution is essential Scaling up May just not even be feasible Even if feasible, it will be very expensive What happens when the machine fails? Scaling out => distributed storage Distribution => added complexity and vulnerabilities latency, consistency, faults, CAP theorem Consistency, Availability, Partition tolerance choose any two? but, not distributing is not a choice! 7

8 Failure Is Inevitable But, failure of the system is not an option! Failure is the pillar of rivals success Solution: Redundancy & Distribution 8

9 Five Levels of Redundancy Physical Virtual resource Availability zone Region Cloud From: 9

10 Redundancy Based Fault Tolerance Replicate data e.g., 3 or more copies In nodes on different racks Can deal with switch failures Power back-up using battery between racks (Google) 10

11 But At What Cost? Failure is not an option, but are the overheads acceptable? 11

12 Reducing the Overheads of Redundancy Erasure codes Much lower storage overhead High level of fault-tolerance In contrast to replication or RAID based systems Has the potential to significantly improve the bottomline e.g., Both Google s new DFS Collossus, as well as Microsoft s Azure now use ECs 12

Erasure Codes (ECs) An (n,k) erasure code = a map that takes

Encoding k=1 block n=3 encoded blocks An erasure code such

13 Erasure Codes (ECs) An (n,k) erasure code = a map that takes as input k blocks and outputs n blocks, thus introducing n-k blocks of redundancy. 3 way replication is a (3,1) erasure code! Encoding k=1 block n=3 encoded blocks An erasure code such that the k original blocks can be recreated out of any k encoded blocks is called MDS (maximum distance separable). 13

14 Reed-Solomon Codes (named after Irving S. Reed and Gustave Solomon) Reed-Solomon Codes are well-known erasure codes. Encoding of (o 1,,o k ) is done by polynomial evaluation: The encoding blocks are then p(α 1 ),,p(α n ). 14

15 Erasure Codes (ECs) Originally designed for communication EC(n,k) Data = message O 1 O 2 O k Encoding B 1 Receive any k ( k) blocks B 2 O 1 Decoding Lost blocks O 2 Reconstru uct Data k blocks B n n encoded blocks O k Original k blocks 15

16 Erasure Codes for Networked Storage Data = Ob bject O 1 O 2 B 2 Encoding B 1 Retrieve any O k ( k) blocks B l Decoding O 1 O 2 Re econstruct t Data O k k blocks B n Lost blocks n encoded blocks O k Oii Original k blocks (stored in storage devices in a network) 16

17 Static Resilience Replicated r times Faults that can be tolerated: r-1 Probability of failure: f r Storage efficiency: 1/r Access: Find any one good replica Erasure coded (k of n) Faults that can be tolerated: n-k Probability of failure: k n n k j f f j n k j (1 ) 1 Storage efficiency: k/n Access: Find k good blocks Assumption: Peer failure is iid i.i.d. with failure probability f 17 k j replic ca object replic ca object For f=0.1 its 10-3 replic ca For f=0.1 3 of 9 code Blk Blk Blk Blk Blk Blk its ~3*10-6 Blk Blk Blk

18 Replenishing Lost Redundancy for ECs B 1 B 2 Repair needed for long term resilience. Retrieve any k ( k) blocks Decoding O 1 O 2 Encoding Recreate lost blocks B l Re-insert B n Lost blocks n encoded blocks 18 O k Original k blocks Repairs are expensive! Reinsert in (new) storage devices, so that there is (again) n encoded blocks

19 CanWeHaveBetterRepairability? Erasure codes tailor-made for distributed networked storage. Localized repairs: E.g., Hierarchical & Pyramid codes Locally repairable codes: E.g., Self-repairing codes, Punctured RM 19 Note: Network coding inspired regenerating codes also aim for better repairability, and specifically to minimize repair bandwidth, however our focus here are codes that reduce the repair fan-in (reduction of bandwidth is often an additional benefit.)

20 What to Tailored-Make the Codes for? Desired code properties include: Low storage overhead Good fault tolerance Better repairability Better Traditional MDS erasure codes achieve these. Smaller repair fan-in Reduced I/O for repairs Possibility of multiple simultaneous repairs Fast repairs Efficient B/W usage Better data-insertion Better migration to archival 20

21 B an Localized repairs: Hierarchical Codes Ạ bottom-up approach Essentially nested use of erasure codes a1 O B Bl l a1 O a2 B l an subgroup global redundancy from the code groups Replicate code group B g 1 local global local redundancy redundancy redundancy B g r B l s1 O s1 O s2 B l sn subgroup Code group Multi-hierarchical extension 21

22 Hierarchical Codes If `small number of faults Communication restricted within the `hierarchy suffice Progressively go higher-up for larger number of faults Isolated faults can be repaired independently Naturally maps to hierarchical data-center design? Asymmetry Different encoded blocks have different importance Difficult to analyze Complex algorithm (for decoding/repair) and system design Pros Cons Hierarchical Codes: How to Make Erasure Codes Attractive for Peer to Peer Storage Systems A. Duminico, E. P2P

23 Localized repairs: Pyramid Codes A top-down approach Example: Consider a MDS (11,8) code A (12,8) Pyramid code derived from the above MDS code: where 23

24 Pyramid Codes Good degraded read performance Cheaper repairs Fault-tolerance: There are regimes with deterministic behavior, and some intermediate regimes with probabilistic behavior So, easier to understand and reason about the system An optimized i version (called d`local reconstruction code ) )is used in Microsoft s Azure system Can be readily extended into a multi-tier pyramid Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems Cheng Huang, Minghua Chen, and Jin NCA

25 Locally Repairable Codes The name is reminiscent of locally decodable codes Codes satisfying: encoded d fragments can be repaired directly from other small subsets (<< k) of encoded fragments Achievable also by codes supporting localized repairs (sometimes) number of live nodes contacted for repair (i.e., fan-in) is infact minimized Fan-in being some small constant, such as 2 or 3 typically independent of code parameters n & k a fragment can be repaired from a fixed number of encoded fragments, independently of which specific blocks are missing Analogous to erasure codes supporting reconstruction using any n - k losses, independently d of which h Partly achieved by some codes supporting localized repairs 25

26 Homomorphic Self-repairing Codes (HSRC) Usual disclaimer: To the best of our knowledge First instance of a locally repairable code Since then, there have been other instances, including Note another SRC variant we proposed (using projective geometric construction - PSRC) from other groups, e.g., punctured Reed-Muller codes k encoded blocks are enough to recreate the object Caveat: not any arbitrary k (i.e., SRCs are not MDS) However, there are many such k combinations Self repairing Homomorphic Codes for Distributed Storage Systems Frédérique Oggier and Anwitaman Datta Infocom

27 Self-repairing Codes: Blackbox View B 1 Retrieve some B 2 k (< k) blocks (e.g. k =2) to recreate a lost block B l Re-insert B n Lost blocks n encoded blocks (stored in storage devices in a network) Reinsert in (new) storage devices, so that there is (again) n encoded blocks 27

28 Homomorphic Self-repairing Codes (HSRC) Preliminaries 28

29 HSRC encoding 29

30 Self-repairing Codes Data = Object B O 1 W/ Linearized polynomial 1 B 2 O 2 Encoding with B l O k k blocks (Each of size M/k) B n n encoded blocks There is at least one pair to repair a node, for up to (n-1)/2 simultaneous failures (Parallel & fast repair of multiple faults) 30

31 HSRC(7,3) example 31

32 HSRC(7,3) example 32

33 HSRC(15,4) example: fast repair Consider Possible pairs to repair each block One possible parallelized repair schedule 33

34 PSRC Example Self repairing Codes for Distributed Storage Systems A Projective Geometric Construction FrédériqueOggier and AnwitamanDatta ITW 2011 (o 1 +o 2 +o 4 ) + (o 1 ) => o 2 +o 4 Repair using two nodes (o 3 ) + (o 2 +o 3 ) => o 2 Say N (o 1 ) + (o 2 ) => o 1 + o 1 and N 3 2 Four pieces needed to regenerate two pieces 34 Repair using three nodes Say N 2, N 3 and N 4 (o 2 ) + (o 4 ) => o 2 + o 4 (o 1 +o 2 +o 4 )+(o)=>o 4 ) o 1 +o 2 Three pieces needed 2012 A. to Datta, regenerate NTU two Singapore pieces

35 PSRC Example: Reconstruction o 3 o 4 (o 3 ) + (o 1 +o 3 ) => o 1 (o 1 )+(o 4 )+(o 1 +o 2 +o 4 ) => o 2 Reconstruction, say using N 3, N 4 and N 5 35

36 Maximum Distance Separable (MDS)? SRC is not MDS (and can not be!) Does it matter? Not much In practice, access will be planned PSRC(21,3) This is with random access 36

37 Practical properties (Current) SRCs are not systematic PSRC is like systematic Need to contact more nodes (than k) To obtain systematic `pieces Same total bandwidth usage Parallel a download oad for access can even be an `advantage `mixed strategies for access, i.e. get some systematic pieces, and some others Power saving (by switching off nodes) strategies possible? Coding/decoding gin PSRC are both using XOR operations only 37

38 Some very recent stuff Data insertion In-network coding for opportunistic back-up (arxiv: ) based on HSRC, exploiting the dependency of encoded pieces Generalization of the idea (for arbitrary LRCs) is outstanding Migrating replicated data into erasure encoded archived storage RapidRAID (arxiv: ) Has some local repairability properties, but that aspect is yet to be explored Another code ICDCN 2013 Systematic code (unlike RapidRAID) Found using numerical methods, and a general theory for the construction of such codes, as well as their repairability properties are open issues 38

39 The data insertion problem 39 For erasure coded data storage Traditionally centralized: Source node needs to encode and store at all storage nodes In contrast: replication is amenable to pipelining Further, if source/recipients are not available at the same time Two issues Storage nodes are busy with other tasks P2P/F2F settings: recipient nodes are offline Partly decentralizing the encoding process Leveraging on SRC s local repairability property Determining a good scheduling mechanism based on nodes availability Even with global (and future) knowledge, computationally prohibitive Scheduling heuristics Summary of results (simulations, so take with a grain of salt) Up to 90% speed-up in storage throughput with Google data center availability and workload traces Up to 60% speed-up with a F2F systems availability trace

40 The data migration to archival problem Data is often not accessed after a little while in the system, and thus can be archived using a storage-efficient efficient erasure code Speeding up this conversion to archival Decentralizing the encoding process Exploiting the existing replication of the object RapidRAID: Summary of results (on a proprietary 50 node cluster of HP ThinClients, and on EC2 instances) Up to 90% reduction of coding time of a single object Up to 20% reduction of coding time for batch processes Implementation (by Lluís Pàmies i Juárez) is available at: 40

41 Future steps/wishlist How to adapt different data management and analytics techniques to be compatible with erasure encoded data? Marrying deduplication techniques with erasure coding techniques Support for mutable content MTTDL analysis for codes with better repairability A complete working system that can be used out of the box by end users HDFS compatible 41

42 Outlook Interested to Follow: sce ntu edu sg/codingfornetworkedstorage/ Two surveys on storage codes one short, at high level, another longer with technical details Get involved: 42

On Coding Techniques for Networked Distributed Storage Systems

On Coding Techniques for Networked Distributed Storage Systems Frédérique Oggier frederique@ntu.edu.sg Nanyang Technological University, Singapore First European Training School on Network Coding, Barcelona,