Repair Pipelining for Erasure-Coded Storage

Similar documents
Your Data is in the Cloud: Who Exactly is Looking After It?

Enabling Efficient and Reliable Transition from Replication to Erasure Coding for Clustered File Systems

On Data Parallelism of Erasure Coding in Distributed Storage Systems

On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice

Performance Models of Access Latency in Cloud Storage Systems

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Latency Reduction and Load Balancing in Coded Storage Systems

Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques. Tianli Zhou & Chao Tian Texas A&M University

Modern Erasure Codes for Distributed Storage Systems

Locally repairable codes

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload. Dror Goldenberg VP Software Architecture Mellanox Technologies

EC-Cache: Load-balanced, Low-latency Cluster Caching with Online Erasure Coding

On Coding Techniques for Networked Distributed Storage Systems

Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage

RAID (Redundant Array of Inexpensive Disks)

Modern Erasure Codes for Distributed Storage Systems

On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems (Supplementary File)

A Transpositional Redundant Data Update Algorithm for Growing Server-less Video Streaming Systems

International Journal of Innovations in Engineering and Technology (IJIET)

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

Advanced Database Systems

Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems

Physical Storage Media

Resume Maintaining System for Referral Using Cloud Computing

CORE: Augmenting Regenerating-Coding-Based Recovery for Single and Concurrent Failures in Distributed Storage Systems

Distributed Video Systems Chapter 5 Issues in Video Storage and Retrieval Part 2 - Disk Array and RAID

Reliable Computing I

Cloud Storage Reliability for Big Data Applications: A State of the Art Survey

The term "physical drive" refers to a single hard disk module. Figure 1. Physical Drive

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!

RAID6L: A Log-Assisted RAID6 Storage Architecture with Improved Write Performance

Staggeringly Large File Systems. Presented by Haoyan Geng

Using Cloud Services behind SGI DMF

Efficient Load Balancing and Disk Failure Avoidance Approach Using Restful Web Services

Distributed File Systems II

Data Integrity Protection scheme for Minimum Storage Regenerating Codes in Multiple Cloud Environment

Technical Report. Recomputation-based data reliability for MapReduce using lineage. Sherif Akoush, Ripduman Sohan, Andy Hopper. Number 888.

HashKV: Enabling Efficient Updates in KV Storage via Hashing

Current Topics in OS Research. So, what s hot?

EECS 121: Coding for Digital Communication & Beyond Fall Lecture 22 December 3. Introduction

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Copysets: Reducing the Frequency of Data Loss in Cloud Storage

Enabling Data Integrity Protection in Regenerating-Coding-Based Cloud Storage: Theory and Implementation (Supplementary File)

CSE 380 Computer Operating Systems

ECE Enterprise Storage Architecture. Fall 2018

Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication

A survey on regenerating codes

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

CLOUD-SCALE FILE SYSTEMS

Giza: Erasure Coding Objects across Global Data Centers

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Software-defined Storage: Fast, Safe and Efficient

COSC 6385 Computer Architecture Storage Systems

All About Erasure Codes: - Reed-Solomon Coding - LDPC Coding. James S. Plank. ICL - August 20, 2004

HKBU Institutional Repository

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Cooperative Pipelined Regeneration in Distributed Storage Systems

Frequently asked questions from the previous class survey

Codes for Modern Applications

Database Systems. November 2, 2011 Lecture #7. topobo (mit)

The Google File System

1. Memory technology & Hierarchy

Building High Speed Erasure Coding Libraries for ARM and x86 Processors. Per Simonsen, CEO, MemoScale May 2017

High Performance Computing Course Notes High Performance Storage

Next-Generation Cloud Platform

CSE 451: Operating Systems. Section 10 Project 3 wrap-up, final exam review

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Definition of RAID Levels

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Enabling Node Repair in Any Erasure Code for Distributed Storage

Encoding-Aware Data Placement for Efficient Degraded Reads in XOR-Coded Storage Systems

Cloud Computing CS

Storing Data: Disks and Files

Cache Management for In Memory. Jun ZHANG Oct 15, 2018

Chapter 6. Storage and Other I/O Topics

Erasure Coding in Object Stores: Challenges and Opportunities

Address Accessible Memories. A.R. Hurson Department of Computer Science Missouri University of Science & Technology

HP AutoRAID (Lecture 5, cs262a)

Today: Coda, xfs! Brief overview of other file systems. Distributed File System Requirements!

Tiered Replication: A Cost-effective Alternative to Full Cluster Geo-replication

Improving Cloud Storage Cost and Data Resiliency with Erasure Codes Michael Penick

CS370 Operating Systems

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Fault-Tolerant Storage and Implications for the Cloud Charles Snyder

Computer Architecture. R. Poss

A Comparative Study on Distributed Storage and Erasure Coding Techniques Using Apache Hadoop Over NorNet Core

Today: Coda, xfs. Case Study: Coda File System. Brief overview of other file systems. xfs Log structured file systems HDFS Object Storage Systems

Store, Forget & Check: Using Algebraic Signatures to Check Remotely Administered Storage

Outline. Spanner Mo/va/on. Tom Anderson

Rethinking Erasure Codes for Cloud File Systems: Minimizing I/O for Recovery and Degraded Reads

Nomad: Mitigating Arbitrary Cloud Side Channels via Provider-Assisted Migration

EC-Cache: Load-balanced, Low-latency Cluster Caching with Online Erasure Coding

Databases in the Cloud

An Introduction to RAID

An Architectural Approach to Improving the Availability of Parity-Based RAID Systems

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

A QoS-Aware Middleware for Fault Tolerant Web Services

Optimizing Software-Defined Storage for Flash Memory

Transcription:

Repair Pipelining for Erasure-Coded Storage Runhui Li, Xiaolu Li, Patrick P. C. Lee, Qun Huang The Chinese University of Hong Kong USENIX ATC 2017 1

Introduction Fault tolerance for distributed storage is critical Availability: data remains accessible under failures Durability: no data loss even under failures Erasure coding is a promising redundancy technique Minimum data redundancy via data encoding Higher reliability with same storage redundancy than replication Reportedly deployed in Google, Azure, Facebook e.g., Azure reduces redundancy from 3x (replication) to 1.33x (erasure coding) PBs saving 2

Erasure Coding Divide file data to k blocks Encode k (uncoded) blocks to n coded blocks Distribute the set of n coded blocks (stripe) to n nodes Fault-tolerance: any k out of n blocks can recover file data Nodes A B A B File divide A B C D encode (n, k) = (4, 2) C D A+C B+D A+D B+C+D C D A+C B+D A+D B+C+D Remark: for systematic codes, k of n coded blocks are the original k uncoded blocks 3

Erasure Coding Practical erasure codes satisfy linearity and addition associativity Each block can be expressed as a linear combination of any k blocks in the same stripe, based on Galois Field arithmetic e.g., block B = a 1 B 1 + a 2 B 2 + a 3 B 3 + a 4 B 4 for k = 4, coefficients a i s, and blocks B i s Also applicable to XOR-based erasure codes Examples: Reed-Solomon codes, regenerating codes, LRC 4

Erasure Coding Good: Low redundancy with high fault tolerance Bad: High repair penalty In general, k blocks retrieved to repair a failed block Mitigating repair penalty of erasure coding is a hot topic New erasure codes to reduce repair bandwidth or I/O e.g., Regenerating codes, LRC, Hitchhiker Efficient repair approaches for general erasure codes e.g., lazy repair, PPR 5

Single-block repair: Conventional Repair Retrieve k blocks from k working nodes (helpers) Store the repaired block at requestor Network Bottleneck N 1 N 2 N 3 N 4 k = 4 helpers R requestor Repair time = k timeslots Bottlenecked by requestor s downlink Uneven bandwidth usage (e.g., links among helpers are idle) 6

[Mitra, EuroSys 16] Partial-Parallel-Repair (PPR) Exploit linearity and addition associativity to perform repair in a divide-and-conquer manner Network N 1 N 2 N 3 N 4 k = 4 helpers R requestor Timeslot 1: N 1 sends a 1 B 1 to N 2 a 1 B 1 +a 2 B 2 N 3 sends a 3 B 3 to N 4 a 3 B 3 +a 4 B 4 Timeslot 2: N 2 sends a 1 B 1 +a 2 B 2 to N 4 a 1 B 1 +a 2 B 2 +a 3 B 3 +a 4 B 4 Timeslot 3: N 4 R repaired block Repair time = ceil(log 2 (k+1)) timeslots 7

Open Question Repair time of erasure coding remains larger than normal read time Repair-optimal erasure codes still read more data than amount of failed data Erasure coding is mainly for warm/cold data Repair penalty only applies to less frequently accessed data Hot data remains replicated Can we reduce repair time of erasure coding to almost the same as the normal read time? Create opportunity for storing hot data with erasure coding 8

Our Contributions Repair pipelining, a technique to speed up repair for general erasure coding Applicable for degraded reads and full-node recovery O(1) repair time in homogeneous settings Extensions to heterogeneous settings A prototype ECPipe integrated with HDFS and QFS Experiments on local cluster and Amazon EC2 Reduction of repair time by 90% and 80% over conventional repair and partial-parallel-repair (PPR), respectively 9

Goals: Eliminate bottlenecked links Repair Pipelining Effectively utilize available bandwidth resources in repair Key observation: coding unit (word) is much smaller than read/write unit (block) e.g., word size ~ 1 byte; block size ~ 64 MiB Words at the same offset are encoded together in erasure coding words at the same offset are encoded together n blocks of a stripe word 10

Idea: slicing a block Repair Pipelining Each slice comprises multiple words (e.g., slice size ~ 32 KiB) Pipeline the repair of each slice through a linear path Single-block repair: a 1 b 1+ a 2 b 2 a 1 b 1+ a 2 b 2 a 1 b 1 a 1 b 1 +a 2 b 2 +a 3 b 3 +a 3 b 3 +a 4 b 4 N 1 N 2 N 3 N 4 R N 1 N 2 N 3 N 4 R k = 4 s = 6 slices N 1 N 2 N 3 N 4 R N 1 N 2 N 3 N 4 R N 1 N 2 N 3 N 4 R time N 1 N 2 N 3 N 4 R Repair time = 1 + (k+1)/s 1 timeslot if s is large 11

Repair Pipelining Two types of single-failure repair (most common case): Degraded read Repairing an unavailable block at a client Full-node recovery Repairing all lost blocks of a failed node at one or multiple nodes Greedy scheduling of multiple stripes across helpers Challenge: repair degraded by stragglers Any repair of erasure coding faces similar problems due to data retrievals from multiple helpers Our approach: address heterogeneity and bypass stragglers 12

Extension to Heterogeneity Heterogeneity: link bandwidths are different Case 1: limited bandwidth when a client issues reads to a remote storage system Cyclic version of repair pipelining: allow a client to issue parallel reads from multiple helpers Case 2: arbitrary link bandwidths Weighted path selection: select the best path of helpers for repair 13

Repair Pipelining (Cyclic Version) N 1 N 2 N 3 N 4 N 2 N 3 N 4 N 1 Send to requestor R N 3 N 4 N 1 N 2 N 4 N 1 N 2 Group 1 N 1 N 2 N 3 N 4 N 2 N 3 N 4 N 1 Send to requestor R N 3 N 4 N 1 N 2 N 4 N 1 N 2 Group 2 Requestor receives repaired data from k-1 helpers Repair time in homogeneous environments 1 timeslot for large s 14

Weighted Path Selection Goal: Find a path of k + 1 nodes (i.e., k helpers and requestor) that minimizes the maximum link weight e.g., set link weight as inverse of link bandwidth Any straggler is associated with large weight Brute-force search is expensive (n-1)!/(n-1-k)! permutations Our algorithm: Apply brute-force search, but avoid search of non-optimal paths If link L has weight larger than the max weight of the current optimal path, any path containing L must be non-optimal Remain optimal, with much less search time 15

Implementation ECPipe: a middleware atop distributed storage system Coordinator Requestor control flow Helper Helper Helper datal flow Node Node Node Requestor implemented as a C++/Java class Each helper daemon directly reads local blocks via native FS Coordinator access block locations and block-to-stripe mappings ECPipe is integrated with HDFS and QFS, with around 110 and 180 LOC of changes, respectively 16

Evaluation ECPipe performance on a 1Gb/s local cluster Trade-off of slice size: Too small: transmission overhead is significant Too large: less parallelization Best slice size = 32 KiB Repair pipelining (basic and cyclic) outperforms conventional and PPR by 90.9% and 80.4%, resp. Single-block repair time vs. slice size for (n,k) = (14,10) Only 7% more than direct send time over a 1Gb/s link O(1) repair time 17

Evaluation ECPipe performance on a 1Gb/s local cluster Recovery rate increases with number of requestors Repair pipelining (RP and RP+scheduling) achieves high recovery rate Full-node recovery rate vs. number of requestors for (n,k) = (14,10) Greedy scheduling balances repair load across helpers when there are more requestors (i.e., more resource contention) 18

Evaluation ECPipe performance on Amazon EC2 Weighted path selection reduces single-block repair time of basic repair pipelining by up to 45% 19

Evaluation Single-block repair performance on HDFS and QFS QFS: slice size QFS: block size HDFS: (n,k) ECPipe significantly improves repair performance Conventional repair under ECPipe outperforms original conventional repair inside distributed file systems (by ~20%) Avoid fetching blocks via distributed storage system routine Performance gain is mainly due to repair pipelining (by ~90%) 20

Conclusions Repair pipelining, a general technique that enables very fast repair for erasure-coded storage Contributions: Designs for both degraded reads and full-node recovery Extensions to heterogeneity Prototype implementation ECPipe Extensive experiments on local cluster and Amazon EC2 Source code: http://adslab.cse.cuhk.edu.hk/software/ecpipe 21