HKBU Institutional Repository

Size: px

Start display at page:

Download "HKBU Institutional Repository"

Barrie Shaw
5 years ago
Views:

1 Hong Kong Baptist University HKBU Institutional Repository Open Access Theses and Dissertations Electronic Theses and Dissertations ESetStore: an erasure-coding based distributed storage system with fast data recovery Chengjian Liu Follow this and additional works at: Recommended Citation Liu, Chengjian, "ESetStore: an erasure-coding based distributed storage system with fast data recovery" (2018). Open Access Theses and Dissertations This Thesis is brought to you for free and open access by the Electronic Theses and Dissertations at HKBU Institutional Repository. It has been accepted for inclusion in Open Access Theses and Dissertations by an authorized administrator of HKBU Institutional Repository. For more information, please contact

2 HONG KONG BAPTIST UNIVERSITY Doctor of Philosophy THESIS ACCEPTANCE DATE: August 31, 2018 STUDENT'S NAME: LIU Chengjian THESIS TITLE: ESetStore: An Erausure-Coding Based Distributed Storage System with Fast Data Recovery This is to certify that the above student's thesis has been examined by the following panel members and has received full approval for acceptance in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Chairman: Internal Members: Dr Peng Heng Associate Professor, Department of Mathematics, HKBU (Designated by Dean of Faculty of Science) Prof Ng Joseph K Y Professor, Department of Computer Science, HKBU (Designated by Head of Department of Computer Science) Prof Xu Jianliang Professor, Department of Computer Science, HKBU External Members: Dr He Bingsheng Associate Professor School of Computing National University of Singapore Singapore Dr Shao Zili Associate Professor Department of Computing The Hong Kong Polytechnic University Proxy: In-attendance: Dr Choi Koon Kau Associate Professor, Department of Computer Science, HKBU Dr Chu Xiaowen Associate Professor, Department of Computer Science, HKBU Issued by Graduate School, HKBU

3 ESetStore: An Erasure-Coding based Distributed Storage System with Fast Data Recovery LIU Chengjian A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Principal Supervisor: Dr. Chu Xiaowen Hong Kong Baptist University August 2018

5 Abstract The past decade has witnessed the rapid growth of data in large-scale distributed storage systems. Triplication, a reliability mechanism with 3x storage overhead and adopted by large-scale distributed storage systems, introduces heavy storage cost as data amount in storage systems keep growing. Consequently, erasure codes have been introduced in many storage systems because they can provide a higher storage efficiency and fault tolerance than data replication. However, erasure coding has many performance degradation factors in both I/O and computation operations, resulting in great performance degradation in large-scale erasure-coded storage systems. In this thesis, we investigate how to eliminate some key performance issues in I/O and computation operations for applying erasure coding in large-scale storage systems. We also propose a prototype named ESetStore to improve the recovery performance of erasure-coded storage systems. We introduce our studies as follows. First, we study the encoding and decoding performance of the erasure coding, which can be a key bottleneck with the state-of-the-art disk I/O throughput and network bandwidth. We propose a graphics processing unit (GPU)-based implementation of erasure coding named G-CRS, which employs the Cauchy Reed-Solomon (CRS) code, to improve the encoding and decoding performance. To maximize the coding performance of G-CRS by fully utilizing the GPU computational power, we designed and implemented a set of optimization strategies. Our evaluation results demonstrated that G-CRS is 10 times faster than most of the other coding libraries. Second, we investigate the performance degradation introduced by intensive I/O ii

6 operations in recovery for large-scale erasure-coded storage systems. To improve the recovery performance, we propose a data placement algorithm named ESet. We define a configurable parameter named overlapping factor for system administrators to easily achieve desirable recovery I/O parallelism. Our simulation results show that ESet can significantly improve the data recovery performance without violating the reliability requirement by distributing data and code blocks across different failure domains. Third, we take a look at the performance of applying coding techniques to in-memory storage. A reliable in-memory cache for key-value stores named R- Memcached is designed and proposed. This work can be served as a prelude of applying erasure coding to in-memory metadata storage. R-Memcached exploits coding techniques to achieve reliability, and can tolerate up to two node failures. Our experimental results show that R-Memcached can maintain very good latency and throughput performance even during the period of node failures. At last, we design and implement a prototype named ESetStore for erasure-coded storage systems. The ESetStore integrates our data placement algorithm ESet to bring fast data recovery for storage systems. Keywords: Erasure coding, Storage System, ESet, R-Memcached, ESetStore iii

7 Acknowledgements First of all, I would like to express my deepest gratitude to my supervisor Dr. CHU Xiaowen for his kind support and patient guidance during my Ph.D. studies. Dr. CHU offered the opportunity for me to continue my research study in He gave the starting point of my research career. This work would not have been possible without his supervision. Besides my supervisor, I would like to thank my former supervisor Dr. LUO Qiuming for his kind help during my studies. I would also like to thank Prof. CHEN Guoliang, Prof. MAO Rui and some other professors in the Shenzhen University for providing valuable research resources during my three years at Shenzhen University. I thank Prof. LEUNG Yiu Wing, Prof. NG, Joseph Kee Yin, Dr. LIU Hai, Dr. LU Haiping, Prof. CHEUNG, Yiu-Ming and some other professors in our department, who gave me valuable advice and comments on my research work. My sincere thanks go to my labmates Dr. MEI Xinxin, Dr. ZHAO Kaiyong, Mr. WANG Qiang, Mr. SHI Shaohuai and Mr. WANG Canhui. You made a comfortable environment for our studies. In particular, I am grateful to WANG Qiang for his insightful help in completing some work on GPU; Dr. MEI Xinxin for resolving my latex issues; SHI Shaohuai for providing a good living environment. It s really a good time with your accompany. At last, I want to thank my parents and my elder sisters. Their kind support is the key reason that I can continue my research work in all the years. iv

8 Table of Contents Declaration i Abstract ii Acknowledgements iv Table of Contents v List of Tables ix List of Figures x Chapter 1 Introduction Notation and Nomemclature Erasure Coding Recovery of Erasure-coded Storage Systems In-memory storage Thesis goals and contributions Organization Chapter 2 Background and Related Work Erasure coding GPU Computing Recovery of Erasure-Coded Storage Reliability of In-memory Storage v

9 Chapter 3 G-CRS: GPU Accelerated Cauchy Reed-Solomon Coding Introduction Cauchy Reed-Solomon Coding Design of G-CRS Baseline Implementation Optimization Strategies Performance Model Kernel Analysis Dominant Factor Analysis Pipelined G-CRS Performance Evaluation Throughput Under Different Workloads Peak Raw Coding Performance Optimization Analysis Overall Performance Summary Chapter 4 ESet: Placing Data towards Efficient Recovery for Largescale Erasure-Coded Storage Systems Problem Definition System Model Problem Illustraion Problem Formulation Reliability Analysis Revisiting Failures Our Solution: ESet Reliability Constraint Design of ESet Grouping for Reliability Generation of ESets vi

10 4.3.3 Recovery of a Failed Host Performance Evaluation Evaluation Overview Recovery I/O Parallelism Analysis Recovery Performance of Simulating A Year Failure Recovery Performance of With Different λ Values Recovery Performance of Burst Failures in An Hour Summary Chapter 5 R-Memcached: a Reliable In-Memory Cache for Big Key- Value Stores Background Introduction to Memcached Reliability Challenge Design and Implementation of R-Memcached System Architecture RAIM Implementation Set, Get and Delete in R-Memcached Asynchronous Update and Degrade Read Reliability Analysis RAIM Set Reliability Reliability of a R-Memcached Cluster Performance Evaluation of R-Memcached Testbed and performance baseline Evaluation of RAIM Evaluation of RAIM Evaluation of RAIM Summary vii

11 Chapter 6 ESetStore: Introducing Fast Data Recovery to the Erasure- Coded Storage System System Architecture of ESetStore The Design and Implementation of ESetStore ECMeta: the Metadata Service Efficient Read and Write Operations Fast Recovery with ESet Evaluation Experimental Setup Read and Write Throughput Recovery Performance Recovery Performance with PPR Summary Chapter 7 Conclusions Future Research Directions Bibliography 120 Curriculum Vitae 132 viii

12 List of Tables 3.1 Features of different types of GPU memory space Categorize the operation type by SM and DRAM Metrics and parameter settings of different GPUs Major Parameters for Measuring Different Workloads Main Notations Data Center Configuration of Simulation Parameters for A Year Simulation Parameter for RAIM Reliability Analysis ix

13 List of Figures 1.1 The Process of Writing a File to an Erasure-Coded Storage System The Architecture of SMs in GTX 980 [17] Illustration of CRS encoding. d i T is the w-row format of data chunk d i. c i T is the w-row format of parity chunk c i A concrete example of CRS encoding. k =2,m=2,w = Input and Output for Coding Abstract Memory Hierarchy of a GPU Access pattern to bitmatrix. tid refers to thread ID Mapping of each bitmatrix element to memory Memory access pattern for a single warp Access pattern of w threads on the data blocks An example of decoding missing blocks (a) Memory throughput with different r (b) Model accuracy Cases for Pipelined I/O and Computation Throughput under Different Workloads Comparison of raw encoding performance Optimization analysis Dominating factor analysis Overall encoding performance Physical Layout of Storage System A naive example of the data distribution x

14 4.3 ESets Distributed Storage System Architecture with ESet Time For Recovering a Failed Host Reliability of Storage System Reliability From Independent Domain Groups in System An example of hosts mapped to ESets Simulator Overview Recovery I/O Parallelism for n=14 and k= Normalized Recovery Performance for A Year Simulation Normalized Recovery Performance for Different λ Normalized Recovery Performance of Burst Failures in An Hour Illustration of Memcached in Web Service System Architecture of R-Memcached RAIM SETs of RAIM-1, RAIM-5 and RAIM Node failure detection in R-Memcached Memory Layout in R-Memcached Get, Set and Delete Operations Reliability of a single RAIM Set Reliability of RAIM Cluster R-Memcached Testbed Throughput and Latency for RAIM Throughput and Latency for RAIM Throughput and Latency for RAIM Main Components of ESetStore File Management on ECMeta Procedure of Create Opeartion Organization of a file and its stripes Example of Read and Write Operations xi

15 6.6 Example of Read and Write Operations with Overlapped I/O and Coding The Procedure of Write Operation in ESetStore The Procedure of Read Operation in ESetStore Testbed of ESetStore Read Throughput in ESetStore Write Throughput in ESetStore Recovery Performance in ESetStore Recovery Performance in ESetStore xii

16 Chapter 1 Introduction The past decade has witnessed the rapid growth of data in large-scale distributed storage systems. Taking ECMRWF storage as an example, which published in Conference on File and Storage Technologies in 2015 [29]. Its storage capacity already reached 100 PB and the data amount in its storage increased at a rate of about 45%. A recent work revealed genomic big data have reached the full storage size of a data center with 100 PB storage capacity[66]. Protecting such a huge amount of data with reliability mechanism like triplication[28] introduce heavy storage cost due to 3x storage overhead. As a consequence, erasure codes have been introduced by many large-scale storage systems as a reliability mechanism with reduced storage overhead and higher reliability. A first example of adopting erasure coding is the Microsoft cloud service Windows Azure Storage [35]. Its storage overhead is 10/6x with Local Reconstruction Codes. Facebook's warehouse [73] and Web service storage system f4 [59] also taking erasure coding as its reliability mechanism. The f4 encodes each ten data blocks into four parity blocks, indicating its storage overhead is 1.4x. The quantcast file system fixed data block as 6 and parity blocks as 3 with the 1.5x storage overhead[65]. In addition, distributed file systems such as Hadoop [22] and Ceph [94] have begun to support erasure coding to yield a higher reliability and lower storage overhead. A general erasure coding system works as follows. Initially, the user data to be protected is divided into k equal sized data chunks. The encoding operation 1

17 gathers all k data chunks and generates m equal sized parity chunks according to an encoding algorithm. In a distributed storage system, the set of n=k+m data and parity chunks are usually stored at different hardware devices to prevent data loss due to device failures. When no more than m devices fail out of these n devices, the chunks in the failed devices become unavailable. To recover the lost data, a decoding operation gathers k available chunks and reproduces the missing chunks according to a decoding algorithm. The erasure codes that can restore m missing chunks from the remaining k alive chunks have the highest error correction capability and are called Maximum Distance Separable (MDS) codes [7]. Reed-Solomon (RS) coding [77] and its variant Cauchy Reed-Solomon (CRS) coding [8] are the two well-known general MDS codes that can support any values of k and m. A file D 0 D 1 D k-1 A stripe D 0 D 1 D k-1 P 1 P m-1 D 0 D 1 P m-1 Host 0 Host 1 Host n-1 Rack 0 Rack 1 Rack n-1 Figure 1.1: The Process of Writing a File to an Erasure-Coded Storage System Fig. 1.1 presents the process of writing a file to an erasure-coded storage system. The file first divided into k equal size data blocks. Then the k data blocks generates m parity blocks. The n blocks together called a stripe. The stripe is distributed into n disks from n hosts belong to n different racks to tolerate disk-level, host-level and rack-level failures. When no more than m disks, hosts or racks failed. We can use any k available blocks to restore any missing block from the failed component. An erasure-coded storage system carries many stripes to provide reliable storage. 2

18 Erasure-coded storage systems confronted many performance degradation factors. The first one is that encoding and decoding operations are computationalintensive tasks, indicating a long coding time. Another major issue is that recovery typically takes a long time due to heavy I/O operations. As for recovering one block, k blocks will be retrieved. This motivates us to present our work, which consists of optimization work to improve the performance of large-scale erasure-coded storage systems. 1.1 Notation and Nomemclature In this section, we give an explanation to erasure codes, stripes and the name of reliability mechanism for in-memory storage. Erasure Codes In general, an erasure code is a mechanism to recover missing data. There are many types of erasure codes designed for special scenarios. In our thesis, we only consider thecaseofaforementionedmds [7] codes. The next paragraph gives a detail of how it works. We use three parameters to define an erasure code: n, k, andm, wheren equals k+m. Erasure coding, which consists of encoding and decoding, is major operations for erasure codes. k equal size data chunks served as the input for encoding. A whole set of n chunks will be served as the output of encoding, where the whole input included in the output. The rest m chunks are called parity chunks. Any k distinct chunks can be served as the input of decoding operation. The decoding operation can generate up to n chunks of the whole set. Stripe Stripe is the basic unit for managing erasure-coded data. We have mentioned the concept of stripe in Fig Here we make a formal decription of it. A stripe consists of n equal size blocks, where k blocks are data blocks and the rest of n-k blocks are parity blocks. An erasure-coded storage system can guarantee its data reliability as long as each stripe has at most n-k missing blocks. The data reliability 3

19 of a distributed system depends on the number of stripes and the data reliability of each stripe, which in turn depends on the values of n and k. Usually, there is a trade-off between the reliability and storage overhead n/k. RAIM RAIM, an abbreviation for Redundant Array of Independent Memory, was recently introduced by IBM[53]. RAIM works in a similar fashion in the memory as RAID in the disk in order to tolerate a certain level of memory channel failures. In our thesis, we extend the concept of RAIM from a single physical server to a distributed system. 1.2 Erasure Coding Erasure coding is crucial for the quality of service and user experience as aforementioned. Modern data centers begin to deploy Gb/s Ethernet or even Infini- Band FDR/QDR/EDR to improve the network speed [6], and disk arrays based on Solid-State Drives (SSD) to improve the disk input-output (I/O) performance [55]. Such technology trend pushes the computationally expensive erasure coding into a potential performance bottleneck in erasure-coded storage systems. Recently, graphics processing units (GPUs) have been used in some storage systems to perform different computationally expensive tasks. Shredder [5] is one framework used for leveraging GPUs to efficiently chunk files for data deduplication and incremental storage. GPUstore[87] is another framework for integrating GPUs into storage systems for file-level or block-level encryption and RAID 6 data recovery. Another GPU-based RAID 6 system has been developed in [41], which uses GPUs to accelerate two RAID 6 coding schemes, namely Blaum-Roth and Liberation codes. This system achieves a coding speed of up to 3GB/s. However, RAID 6 only supports up to two disk failures and is not suitable for large-scale systems. To date, Gibraltar [20], which employs table lookup operations to implement Galois field multiplications, is the most successful GPU-based Reed-Solomon coding library; notably, the system has much higher speed over the single-thread Jerasure 4

20 [70], which is the most popular erasure coding library for CPUs. PErasure[14] is a recent CRS coding library for GPUs and its performance is even better than Gibraltar. However, PErasure does not fully utilize the GPU memory system and results in sub-optimal performance. With the rapid improvement of networking speed and aggregated disk I/O throughput, there is a demand to further improve the coding performance. Data reliability and availability are critical requirements for data storage systems. Although coding-based RAID 5/6 have become the industry standards for decades, replication is still the de facto data protection solution in large-scale distributed storage systems. With the increase in the amounts of data and the deployment of expensive SSDs, there is a great opportunity for erasure coding because it can provide much lower storage overhead and higher reliability compared with data replication. In [47], a comprehensive comparison of erasure coding and replication was presented. However, erasure coding is a compute- and data-intensive task which brings practical challenges to its adoption in distributed storage systems. In an erasure coded distributed storage system, there could be three potential performance bottlenecks: the aggregated disk I/O throughput, the network bandwidth, and the coding performance. Modern data centers have started to deploy high-speed network switches with more than 40Gb/s of bandwidth per network port [6]. Facebook and LinkedIn are already working on 100Gb/s network for their data centers [37][100]. Meanwhile, the sequential I/O throughput of a single SSD has been improved to more than 4Gb/s [92], and the aggregated I/O throughput of a disk array can easily match the network bandwidth. However, the throughput of software implemented erasure coding is inversely proportional to the number of parity chunks, and is typically less than 10 Gb/s on multi-core CPUs [14], which makes erasure coding impractical for large-scale distributed systems. Modern GPUs have tens of TFlops computation power and an internal memory bandwidth of a few hundreds of GB/s, providing an opportunity to speed up erasure coding to saturate the disk I/O and network bandwidth. This motivates us to design and implement 5

21 G-CRS to fully utilize the GPU power and achieve a high coding throughput. 1.3 Recovery of Erasure-coded Storage Systems Many recent studies have revealed that data centers are frequently suffering disklevel [80][67] and host-level [63][31] failures[79][61][30][27]. This makes data recovery part of daily work for storage systems. For example, Facebook warehouse transfers around 100 TB of data each day to recover data from its failed disks and hosts [73]. Data placement scheme plays a critical role in erasure-coded storage systems. First of all, the data placement scheme should guarantee very high data reliability and availability that can withstand a certain level of disk failures, host failures, and rack failures. Secondly, it should achieve a desirable data recovery throughput. However, recovering a failed host or disk is very time consuming for erasure-coded storage systems[93]. Using the popular Reed-Solomon code as an example, to recover an individual data block requires to fetch a total of k blocks. Consider a scenario that a data center needs to recover a failed disk with 1 TB of data, and k = 10. Then it requires to access 10 TB of data to recover the failed disk. Assume the disk I/O throughput is 100 MB/s, then the recovery time will be at least 10 5 seconds if the 10 TB of data blocks can only be accessed sequentially. To reduce the recovery time, it is imperative to distribute those 10 TB of data blocks among many different disks at the very beginning, and hence we can exploit parallel disk I/O to speed up the data recovery process. Another approach to reducing data recovery time is to design new erasure code that requires less data to recover a failed data block. In this work, we focus on exploiting I/O parallelism. A simple yet popular data placement scheme is random data placement, which distributes data and code blocks randomly. This makes storage systems able to aggregate I/O from more than k hosts or disks to recover a failed host or disk. However, the recovery performance may not be always guaranteed. E.g., Facebook s F4 took two days to recover the data on a host as the data and code blocks for the 6

22 failed hosts are not well distributed [59]. Another side-effect is that, during the recovery period (a.k.a. degraded read), service latency can become 10x than normal mode and service quality decreases greatly. To guarantee recovery performance for each host, a data placement algorithm must ensure that adequate hosts can participate in the recovery process. However, this aspect is neglected by existing studies. This motivates us to design a placement algorithm to introduce efficient recovery. 1.4 In-memory storage In recent years, key-value stores have been widely used in many commercial largescale Web-based systems, including Amazon, Facebook, YouTube, Twitter, GitHub, and Linkedin. In order to reduce the data access latency caused by disk I/O, an inmemory cache system is usually deployed between the front-end Web system and the back-end database system. For example, Facebook is using a very large distributed in-memory cache system built from the popular Memcached [26] [62], which consists of thousands of server nodes [58]. While in-memory storage can greatly improve the performance of the key-value store, there are optimizations to further improve its performance. LSM-trie is an inmemory key-value store for small data [97]. Some studies also tried flash memory to provide similar performance while reducing the per-bit cost of storage [83] [84][96]. This indicates performance is a key concern to in-memory storage. In a large-scale in-memory key-value store system, node failure becomes very common [88], which may seriously affect the access latency and user experience. How to improve the reliability of the distributed cache system becomes an important issue. Redundancy techniques such as RAID [12] have been widely used in hard diskbased storage systems to offer fault-tolerance. RAID-1 is basically the same as data replication, which achieves very good reliability but requires a double cost. RAID- 5 and RAID-6 improve the storage efficiency at the expense of decreasing access 7

23 performance when faced with disk failures. But currently, how these redundancy techniques work in memory is still unknown and worth studying. 1.5 Thesis goals and contributions The goals of this thesis are in the following: 1. We aim to improve the performance of time-consuming encoding and decoding of erasure coding with the help of GPU. 2. We study placement algorithm to provide reliable storage and improved recovery performance for erasure-coded storage systems. 3. We investigate the performance of applying coding techniques to harvest reliable in-memory storage. 4. We propose a prototye named ESetStore for erasure-coded storage systems. The ESetStore brings efficient recovery with the integration of our data placement algorithm ESet. The overall objective of this thesis is to improve the performance of large-scale erasure-coded storage systems. Towards this objective and the goals we mentioned above, this thesis makes the following contributions: We propose a graphics processing unit (GPU)-based implementation of erasure coding named G-CRS, which employs the Cauchy Reed-Solomon (CRS) code. We designed and implemented a set of optimization strategies, such as a compact structure to store the bitmatrix in GPU constant memory, efficient data access through shared memory, and decoding parallelism, to maximize coding performance by fully utilizing the GPU resources. In addition, we derived a simple yet accurate performance model to demonstrate the maximum coding performance of G-CRS on GPU. We evaluated the performance of G-CRS through extensive experiments on modern GPU architectures such as Maxwell 8

24 and Pascal, and compared with other state-of-the-art coding libraries. The evaluation results revealed that the throughput of G-CRS was 10 times faster than most of the other coding libraries. We present a data placement strategy named ESet which brings recovery efficiency for each host in a distributed storage system. We define a configurable parameter named overlapping factor for system administrators to easily achieve desirable recovery I/O parallelism. Our simulation results show that ESet can significantly improve the data recovery performance without violating the reliability requirement by distributing data and code blocks across different failure domains. We present the design, implementation, analysis, and evaluation of R-Memcached, a reliable in-memory key-value cache system that is built on top of the popular Memcached software. R-Memcached exploits coding techniques to achieve reliability, and can tolerate up to two node failures. Our experimental results show that R-Memcached can maintain very good latency and throughput performance even during the period of node failures. We present the design, implementation and evaluation of ESetStore, a prototype for erasure-coded storage systems. ESetStore can harvest good read and write performance. The evaluation demonstrates ESetStore can bring efficient recovery with our designed data placement algorithm ESet. 1.6 Organization The remainder of this thesis is organized as follows: Chapter 2 introduces the background and provide an overview of the related literature. Chapter 3 provides the design and implementation of our G-CRS at length. This chapter is based on joint work with Mr. WANG Qiang, Dr. CHU Xiaowen, Prof. 9

25 LEUNG Yiu-Wing and has been published in IEEE Transactions on Parallel and Distributed Systems in 2018 [51]. Chapter 4 studies the placement algorithm for erasure-coded storage systems. We demonstrate how our proposed ESet brings recovery efficiency for each host in an erasure-coded large-scale storage system. This chapter is based on joint work with Dr. LIU Hai, Dr. CHU Xiaowen, Prof. LEUNG Yiu-Wing and has been presented at International Conference on Computer Communication and Networks in 2016 [48]. Chapter 5 investigates the performance of applying coding techniques to in-memory storage. The R-Memcached consists of RAIM-1, RAIM-5, and RAIM-6. It can tolerate up to two node failures while maintaining good latency and throughput performance. This chapter is based on joint work with Dr. OUYANG Kai, Dr. LIU Hai, Dr. CHU Xiaowen, Prof. LEUNG, Yiu-Wing and has been presented in International Conference on Big Data Computing and Communications[50] and published in Tsinghua Science and Technology in 2015 [49]. Chapter 6 presents the design, implementation, and evaluation of ESetStore. The evaluation results demonstrate ESetStore can achieve good read and write performance meanwhile harvesting fast data recovery. Chapter 7 draws the conclusion and present some future directions of our research. 10

26 Chapter 2 Background and Related Work In this chapter, we present the background and some related work of our thesis. We first present related literature for erasure coding. Then we introduce the GPU computing. At last, we talk about the recovery of erasure-coded storage and inmemory reliability. 2.1 Erasure coding Erasure coding is performed when writing data to storage systems and recovering any missing data in storage systems. Thus its performance is crucial to the erasure-coded storage system. Over the past decades, many research works have been proposed to improve the performance of erasure coding. One pioneer study optimizes the Cauchy distribution matrix that results in better coding performance for CRS coding on the CPU by performing less XOR operations[71]. Jerasure [70] is a popular library that implements various kinds of erasure codes on the CPU, including optimized CRS coding. Optimization with efficient scheduling of XOR operations on CPU is presented in [52]. These sequential CRS algorithms are designed for CPUs and are not suitable for GPUs due to their complicated control flows. Another thread of research aims to exploit parallel computing techniques to speed up erasure coding. For multi-core CPUs, CRS codes have been parallelized in [42, 86]; EVENODD codes have been parallelized in [23]; 11

27 and RDP codes have been parallelized in [24]. A fast Galois field arithmetic library for multi-core CPUs with the support of Intel SIMD instructions has recently been presented in [68]. Although these works have achieved great improvement, their coding performance is still not comparable to the throughput of today s high-speed networks, especially when a large number of parity data chunks are required for higher data reliability. These parallel algorithms cannot be directly applied to GPUs either, due to their different hardware architectures. For many-core GPUs, the Gibraltar library [20] implements the classical Reed-Solomon coding and outperforms many existing coding libraries on CPUs. PErasure[14] is a recent CRS coding library for GPUs and its performance is better than Gibraltar. However, PErasure does not fully utilize the GPU memory system and results in sub-optimal performance. 2.2 GPU Computing Modern GPUs are typically equipped with hundreds to thousands of processing cores evenly distributed on several streaming multiprocessors (SMs). For example, the Nvidia GTX 980 with Maxwell architecture contains 16 SMs and 4GB off-chip GDDR memory named global memory. Fig. 2.1 presents the layout of streaming multiprocessors of GTX 980 from [17]. Each SM has 128 Stream Processors (SPs, or cores) and a 96-KB on-chip memory named shared memory, which has much higher throughput and lower latency than the off-chip GDDR memory [54]. Besides the 2MB L2-cache shared by the whole GPU, each SM also has a small amount of on-chip cache to speed up the data access of read-only constant memory. Fig. 3.4 present an abstraction of a GPU s memory hierarchy. Currently, CUDA is the most popular programming model for GPUs [19]. A typical CUDA program comprises host functions, which are executed on the central processing unit (CPU), and kernel functions, which are executed on the GPU. Each kernel function runs as a grid of threads, which are organized into many equal sized thread blocks. Each thread block can include a set of threads distributed in a number 12

28 Figure 2.1: The Architecture of SMs in GTX 980 [17] of thread warps, each of which has 32 threads that execute the same instruction at a time. Threads in a thread block can share data through their shared memory and perform barrier synchronization. 13

29 Modern GPUs are embedded with hundreds to thousands of arithmetic processing units that provide tremendous computing power, and attracts many work to port computationally intensive applications from CPUs to GPUs. For example, G- Blastn [102], which is a nucleotide alignment tool for the GPU, achieves more than 10 times of performance improvement over its CPU version NCBI-BLAST. The accleration of dynamic graph analytics is presented in [81]. In [91], regular expression matching on GPUs can be 48 times faster than that on CPUs. A high performance key-value store with GPU is presented in [101]. SSLShader [36] and AES [44] both demonstrate great performance improvement when data encryption algorithms are offloaded to GPUs. Network coding on GPUs, such as those in [15], [13] and Nuclei [85], are the most closely related work in addition to the aforementioned Gibraltar [20] and PErasure [14]. These studies aim to improve the performance of network coding to match the throughput of high speed networks for both encoding and decoding. 2.3 Recovery of Erasure-Coded Storage Recovery is a time-consuming task in an erasure-coded storage system due to intensive I/O operation as k chunks will be retrieved to recover a failed chunk. Researchers tried to optimize recovery from two aspects. One is that reduce I/O operations when performing recovery. Another one makes a better utilization of available I/O resources. Reducing I/O operations is a commonly used method for improving recovery performance. A replace recovery algorithm is proposed in [103] to reduce I/O operations in XOR-based erasure codes. A solution to reduce required symbols for recovering is proposed for RDP code [98]. Hitchhiker[74] is a code mechanism e- volved from the family of regeneration codes [75] to reduce the network traffic and disk IO by around 25% to 45% during the reconstruction process. There will be k storage servers to recover a missing chunk or a failed storage server. Partial Parallel Repair is a technique for better utilizing these storage servers 14

30 for reconstruction with pipelined I/O mechanism [57]. ECPipe[45] is a pipelined I/O mechanism for improving recovery performance in a heterogeneous environment by effectively utilizing available I/O resources. The essential goal of a storage system is to store data and provide quality of service for data access. When a storage system scales to more than thousands of hosts and tens of thousands of disks, failures are daily happened events. To prevent data loss caused by failures, data recovery is being performed every day. During the data recovery process, the accesses to the lost data (called degraded reads) suffer long latency due to data decoding operations. This makes recovery performance play a crucial role in storage systems to keep the quality of service. A report [78] revealed that after a major failure event, some 60% needs more than 4 hours to restore its service due to necessary recovery. And around 20% needs more than a whole day to come back to normal service. This indicates that a storage system must take recovery performance into consideration when making data layout optimization, otherwise the system service quality would be heavily damaged. Recovery performance is an essential design goal for data placement algorithm[1]. Consider a storage system with m disks, when a disk fails, the recovery performance is optimal if m-1 disks can participate in its recovery and each disk contributes equal I/O for recovering the failed disk. For replication based storage systems, stein system can be applied to achieve optimal recovery performance for a certain number of hosts. The optimal solution is achieved for recovery when some conditions are satisfied [43]. However, it only works with some special numbers of hosts [40][34]. Thus, researchers are trying to design near-optimal stein system. A near optimal parallelism is proposed in [1] for storage systems with a few disks. In [82], it gives a data placement algorithm for replication to obtain optimal parallelism for replication for disks. Copysets [16] addresses the issue of data loss for replication based storage systems by designing near-optimal stein system. It uses the scatter width to represent how many I/O parallelisms a host can obtain for its recovery. It selects hosts from 15

31 different racks to form a group to avoid concurrent failures in the same group. Some permutations are made in the group and it makes each host distribute its replicates across hosts in the group so that each host can obtain near optimal recovery performance. When a host fails, all other hosts may contribute near equal I/O to recover the failed host. However, Copysets is not directly applicable to erasure-coded storage systems, because to restore a data block, only one host is needed for replication, but k hosts are required for erasure-coding. Random data placement algorithms, such as RUSH [32][33], CRUSH [95], Random Slicing [56], can work with both replication based and erasure-coded storage system. These algorithms can guarantee reliability by randomly placing replicates across all failure domains, where each failure domain contains a set of disks or hosts that may become unavailable when a shared component failed. But they mainly focus on how to distribute data evenly on large-scale storage systems. Distributing data to obtain good recovery performance for each host is overlooked. Intuitively the number is large enough to achieve good recovery performance. However, due to randomness, the number of hosts participating in recovering a failed host can not be configured. And the recovery performance of each host may not satisfy the requirement of storage systems. Some storage systems map each host to many virtual storage nodes to accelerate host s recovery performance. When a host fails, all other hosts can participate in its recovery. However, it makes storage systems easy to suffer data loss when more than n-k disks fail concurrently. In summary, existing data placement algorithms cannot make an efficient recovery for erasure-coded storage systems meanwhile guarantee system reliability, which would further jeopardize the service quality and reliability of a storage system. This motivates our work to design a placement algorithm for large-scale erasure-coded storage systems to bring an efficient recovery performance and ensure system reliability. 16

32 2.4 Reliability of In-memory Storage Storing metadata in main memory is the key to improve system performance like Facebook s storage system Haystack [4]. There are also other storage systems store entire data in memory to provide high performance storage. Storing data in memory greatly reduces accessing latency and improves throughput compared with diskbased storage. Examples of in-memory storage include Memcached[25], Redis[76], RAMCloud[64]. Memcached is an in-memory cache system used to reduce data access latency for key-value stores. It has been employed by Facebook to improve its Web service performance. Redis is also an in-memory key-value store with disks as a backup for persistent storage. RAMCloud offers low-latency access to large-scale datasets as it stores all data in DRAM. However, these systems facing various kinds of tradeoff of applying reliability mechanism to offer reliable in-memory storage. A straightforward and easy implement reliability mechanism is to write data to disk to prevent data loss against any kind of failure. This approach is adopted by systems like Redis[76]. It writes data to disk as the backup when storing data in memory. When performing data recovery, the data is reloaded from disk to memory. Using this approach can greatly save the storage cost as disk storage is quite cheap compared with in-memory storage. However, writing the backup to disk can ensure reliability but at the cost of huge performance penalty when encountering failure[4]. RAMCloud keeps a copy of each object in the memory, which doubles the storage overhead of memory storage [63]. The storage cost will be increased greatly. As a consequence, coding techniques have been studies to ensure reliability with lower storage cost while reducing performance penalty caused by failures [11] [72] [99]. In summary, the expensive per bit storage of memory storage making reliable in-memory storage encounters tradeoff. It needs to reduce storage cost while maintaining access performance at a certain level. This motivates us to study the performance degradation of applying coding techniques to in-memory storage. To this end, this motivates us to design and implement G-CRS, a GPU acceler- 17

33 ated Cauchy Reed-Solomon coding library that can match the performance of disk I/O and network bandwidth. We design a data placement algorithm named ESet to improve the recovery performance of erasure-coded storage systems to the desired level. We designed and implemented R-Memcached to apply coding techniques to in-memory storage. The work can be a reference to choose a proper coding scheme for in-memory storage. Last but not least, we design and implement ESetStore, a prototype of the erasure-coded storage system with fast data recovery. 18

34 Chapter 3 G-CRS: GPU Accelerated Cauchy Reed-Solomon Coding In this chapter, we present a graphics processing unit (GPU)-based implementation of erasure coding named G-CRS, which employs the Cauchy Reed-Solomon (CRS) code, to improve erasure coding performance. To maximize the coding performance of G-CRS, we designed and implemented a set of optimization strategies, such as a compact structure to store the bitmatrix in GPU constant memory, efficient data access through shared memory, and decoding parallelism, to fully utilize the G- PU resources. In addition, we derived a simple yet accurate performance model to demonstrate the maximum coding performance of G-CRS on GPU. We evaluated the performance of G-CRS through extensive experiments on modern GPU architectures such as Maxwell and Pascal, and compared with other state-of-the-art coding libraries. The evaluation results revealed that the throughput of G-CRS was 10 times faster than most of the other coding libraries. Moreover, G-CRS outperformed PErasure (a recently developed, well optimized CRS coding library on the GPU) by up to 3 times in the same architecture. 19

35 3.1 Introduction In this chapter, we present the design of a new CRS coding library for GPUs, namely G-CRS, that can fully utilize the GPU resources and deliver high coding performance that can saturate the state-of-the-art network speed. To this end, we have designed new data structures and a set of optimization strategies. G-CRS can achieve more than 50GB/s of raw coding performance on a modern GPU Nvidia Titan X for the case of m = 16 (i.e., the system can withstand up to 16 device failures) 1. For the optimization of G-CRS, we present a step-by-step optimization analysis to reveal the method of utilizing GPUs to accelerate CRS coding. We believe our work can be beneficial to other algorithms and applications with similar data access and computational patterns. A propose a simple yet accurate performance model to understand the major factors that affect the coding performance. The pipelined mechanism is investigated to enable our G-CRS to achieve a peak performance by efficiently overlapping data copy operations and coding operations. 3.2 Cauchy Reed-Solomon Coding kw kw mw x 0,0 x 0,1 x 0,kw-1 x mw-1,0 x mw-1,1 x mw-1,kw-1 d 0 T d 1 T d k-1 T Data = d 0 T d 1 c 0 T d k-1 T T c m-1 T Generator matrix Data+Codeword Figure 3.1: Illustration of CRS encoding. d i T is the w-row format of data chunk d i. c i T is the w-row format of parity chunk c i. 1 The source code and experimental results of our G-CRS are available at hkbu.edu.hk/~chxw/gcrs.html. 20

36 As illustrated in Fig. 3.1, the encoding procedure of CRS takes k equal sized data chunks d 0,d 1,..., d k 1 as input, and generates m parity chunks c 0,c 1,..., c m 1 as output. To perform the coding, it needs to select an integer parameter w that is no less that log 2 (k + m). Hence a CRS code can be defined by a triple (k, m, w). CRS first defines an m k Cauchy distribution matrix over Galois Field GF (2 w ), and then expands it into a (k + m)w kw generator matrix over GF (2) whose elements are either 1 or 0 [8]. Notice that the top kw rows of the generator matrix is an identity matrix. Each data chunk d i needs to be transformed into w rows, denoted by D i,0,d i,1,..., D i,w 1.Thew-row format of data chunk d i is denoted by d i T. Then all the k data chunks can be combined into a data matrix with kw rows. By multiplying the generator matrix and data matrix over GF (2), we can get k +m output chunks that include the k original data chunks (due to the kw kw identity sub-matrix) and the m parity chunks. Notice that the multiplication in GF (2) can be implemented by efficient bit-wise XOR operations, which is the major property of CRS D 0,0 D 0,1 D 1,0 D 1,1 C 0,0 D 0,0 D 0,1 D 1,1 C 0,1 D 0,0 D 1,0 D 1,1 C 1,0 D 0,1 D 1,0 D 1,1 C 1,1 D 0,0 D 0,1 D 1,0 Figure 3.2: A concrete example of CRS encoding. k =2,m =2,w =2. Fig. 3.2 presents a concrete example of the CRS encoding process where k=2, m=2 and w=2. Data chunk d i consists of two rows: D i,0 and D i,1, i =0, 1. In the actual encoding process, we only need to calculate the m parity chunks using the bottom mw rows of the generator matrix. For each parity chunk, the values from the corresponding row vector in the generator matrix determine which data chunks will be involved in the XOR operations. For example, the fifth row vector from the generator matrix in Fig. 3.2 < 1, 1, 0, 1 > determines that C 0,0 is generated from D 0,0,D 0,1 and D 1,1 with XOR operations. 21

37 When either a data chunk or a parity chunk becomes unavailable due to device failure, a decoding operation is triggered to restore the missing chunk. To recover a parity chunk, w row vectors from the generator matrix together with all data chunks serve as the input. By contrast, to recover a data chunk, an inverse matrix is generated from the generator matrix (which can be done offline in advance), and k alive chunks serve as the input with w row vectors from the inverse matrix. The encoding and decoding operations are essentially the same in terms of data access pattern and computation. Therefore, we use the term coding to represent both encoding and decoding. 3.3 Design of G-CRS In this section, we first present a high-level view of G-CRS and define our terminologies. Next, we provide a baseline implementation of CRS coding on GPUs that directly migrates from the CPU version. Subsequently, we analyze the potential drawbacks of this basic design and provide a set of optimization strategies to accelerate CRS coding on GPUs. Our G-CRS is implemented by applying all optimization strategies described in this section. Fig. 3.3 illustrates the system architecture of G-CRS that implements a (k, m, w) CRC code. The bitmatrix stores the bottom mw rows of the generator matrix of the CRS code. The input data is divided into k equal sized data chunks, and the output includes m parity chunks with the same size. We use s to represent thenumberofbytesofalong data type on the target hardware platform, i.e., s = sizeof(long). We define a packet as s consecutive bytes. The XOR of two packets can then be efficiently carried out by a single instruction. We define a data block as w consecutive packets, where w is the parameter of CRC and should be no less than log 2 (k + m). The number of data blocks in a chunk is denoted by N. We summarize the high level workflow of G-CRS in Algorithm 1. When the size of user data is greater than the available GPU memory, we will encode the data in different rounds. 22

38 1 global void crs coding kernel naive 2 ( long in, long out, int bitmatrix, 3 int size, int k, int m, int w) { 4 int blockunits = blockdim.x / w; 5 int blockpackets = blockunits w; 6 int tid = threadidx.x + blockidx.x blockpackets ; 7 int unit id offset = tid / w w; 8 int unit in id = tid % w, i, j ; 9 10 if (threadidx.x >= blockpackets) return ; 11 if (tid >= size ) return ; int index = threadidx.y k w w+unit in id k w; 14 long result = 0, input ; for (i = 0; i < k ; ++i ) { 17 for (j = 0; j < w ; ++j ) { 18 if (bitmatrix [ index ] == 1) { 19 input = (in+size i+unit id offset+j); 20 result = resultˆinput ; 21 } 22 ++in d e x ; 23 } 24 } (out + size threadidx.y + tid) = result ; 27 } Listing 1: A baseline implementation of CRS coding on GPU Baseline Implementation Listing 1 presents the baseline implementation of the CRS coding kernel written in CUDA, which is directly migrated from a CPU version. The kernel function 23

long *in Input data of L bytes (e.g., a file) Data Chunk 0 mw kw bitmatrix Packet 0 Packet 1... Packet w-1 Block 0 Data Chunk 1 Packet 0 Packet 1... Packet w-1 Block 1 Packet 0 Packet 1.

In this implementation, each thread is responsible for encoding a single packet.

This kernel function works as follows: (1) The input buffer in and the output buffer out are located in the GPU global memory.

(3) Each thread calculates the initial index of its assigned packet in each data block (line 13).

39 long *in Input data of L bytes (e.g., a file) Data Chunk 0 mw kw bitmatrix Packet 0 Packet 1... Packet w-1 Block 0 Data Chunk 1 Packet 0 Packet 1... Packet w-1 Block 1 Packet 0 Packet 1... Packet w-1 Block N-1 long *out Parity Chunk 0 Parity Chunk 1 Parity Chunk m-1 Data Chunk k-1 Figure 3.3: Input and Output for Coding defines the behavior of a single GPU thread. In this implementation, each thread is responsible for encoding a single packet. When the kernel function is launched, a total of mwn threads will be created to generate the m parity chunks in parallel. Each element of bitmatrix is represented by an integer. This kernel function works as follows: (1) The input buffer in and the output buffer out are located in the GPU global memory. (2) The mw bottom row vectors from the generator matrix are stored in the bitmatrix, which is also located in the global memory. (3) Each thread calculates the initial index of its assigned packet in each data block (line 13). Then, each thread iterates its corresponding row in the bitmatrix to determine the data required to perform the XOR operation (lines 16-24). (4) Each thread writes a packet, which is stored in the variable result, to the output buffer (line 26). Some severe performance penalties exist in this baseline implementation, implying that the GPU resources are substantially under-utilized. To design a fully optimized version of CRS coding on GPU, a thorough understanding of the GPU architecture is required, including its memory subsystem and its method of handling branch divergence. We have identified three major performance penalties of the baseline implementation: Inefficient memory access: The memory access pattern of the baseline kernel implementation in its global memory causes a considerable performance penalty. 24

Repair Pipelining for Erasure-Coded Storage

Repair Pipelining for Erasure-Coded Storage Runhui Li, Xiaolu Li, Patrick P. C. Lee, Qun Huang The Chinese University of Hong Kong USENIX ATC 2017 1 Introduction Fault tolerance for distributed storage