HKBU Institutional Repository

Size: px
Start display at page:

Download "HKBU Institutional Repository"

Transcription

1 Hong Kong Baptist University HKBU Institutional Repository Open Access Theses and Dissertations Electronic Theses and Dissertations ESetStore: an erasure-coding based distributed storage system with fast data recovery Chengjian Liu Follow this and additional works at: Recommended Citation Liu, Chengjian, "ESetStore: an erasure-coding based distributed storage system with fast data recovery" (2018). Open Access Theses and Dissertations This Thesis is brought to you for free and open access by the Electronic Theses and Dissertations at HKBU Institutional Repository. It has been accepted for inclusion in Open Access Theses and Dissertations by an authorized administrator of HKBU Institutional Repository. For more information, please contact

2 HONG KONG BAPTIST UNIVERSITY Doctor of Philosophy THESIS ACCEPTANCE DATE: August 31, 2018 STUDENT'S NAME: LIU Chengjian THESIS TITLE: ESetStore: An Erausure-Coding Based Distributed Storage System with Fast Data Recovery This is to certify that the above student's thesis has been examined by the following panel members and has received full approval for acceptance in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Chairman: Internal Members: Dr Peng Heng Associate Professor, Department of Mathematics, HKBU (Designated by Dean of Faculty of Science) Prof Ng Joseph K Y Professor, Department of Computer Science, HKBU (Designated by Head of Department of Computer Science) Prof Xu Jianliang Professor, Department of Computer Science, HKBU External Members: Dr He Bingsheng Associate Professor School of Computing National University of Singapore Singapore Dr Shao Zili Associate Professor Department of Computing The Hong Kong Polytechnic University Proxy: In-attendance: Dr Choi Koon Kau Associate Professor, Department of Computer Science, HKBU Dr Chu Xiaowen Associate Professor, Department of Computer Science, HKBU Issued by Graduate School, HKBU

3 ESetStore: An Erasure-Coding based Distributed Storage System with Fast Data Recovery LIU Chengjian A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Principal Supervisor: Dr. Chu Xiaowen Hong Kong Baptist University August 2018

4

5 Abstract The past decade has witnessed the rapid growth of data in large-scale distributed storage systems. Triplication, a reliability mechanism with 3x storage overhead and adopted by large-scale distributed storage systems, introduces heavy storage cost as data amount in storage systems keep growing. Consequently, erasure codes have been introduced in many storage systems because they can provide a higher storage efficiency and fault tolerance than data replication. However, erasure coding has many performance degradation factors in both I/O and computation operations, resulting in great performance degradation in large-scale erasure-coded storage systems. In this thesis, we investigate how to eliminate some key performance issues in I/O and computation operations for applying erasure coding in large-scale storage systems. We also propose a prototype named ESetStore to improve the recovery performance of erasure-coded storage systems. We introduce our studies as follows. First, we study the encoding and decoding performance of the erasure coding, which can be a key bottleneck with the state-of-the-art disk I/O throughput and network bandwidth. We propose a graphics processing unit (GPU)-based implementation of erasure coding named G-CRS, which employs the Cauchy Reed-Solomon (CRS) code, to improve the encoding and decoding performance. To maximize the coding performance of G-CRS by fully utilizing the GPU computational power, we designed and implemented a set of optimization strategies. Our evaluation results demonstrated that G-CRS is 10 times faster than most of the other coding libraries. Second, we investigate the performance degradation introduced by intensive I/O ii

6 operations in recovery for large-scale erasure-coded storage systems. To improve the recovery performance, we propose a data placement algorithm named ESet. We define a configurable parameter named overlapping factor for system administrators to easily achieve desirable recovery I/O parallelism. Our simulation results show that ESet can significantly improve the data recovery performance without violating the reliability requirement by distributing data and code blocks across different failure domains. Third, we take a look at the performance of applying coding techniques to in-memory storage. A reliable in-memory cache for key-value stores named R- Memcached is designed and proposed. This work can be served as a prelude of applying erasure coding to in-memory metadata storage. R-Memcached exploits coding techniques to achieve reliability, and can tolerate up to two node failures. Our experimental results show that R-Memcached can maintain very good latency and throughput performance even during the period of node failures. At last, we design and implement a prototype named ESetStore for erasure-coded storage systems. The ESetStore integrates our data placement algorithm ESet to bring fast data recovery for storage systems. Keywords: Erasure coding, Storage System, ESet, R-Memcached, ESetStore iii

7 Acknowledgements First of all, I would like to express my deepest gratitude to my supervisor Dr. CHU Xiaowen for his kind support and patient guidance during my Ph.D. studies. Dr. CHU offered the opportunity for me to continue my research study in He gave the starting point of my research career. This work would not have been possible without his supervision. Besides my supervisor, I would like to thank my former supervisor Dr. LUO Qiuming for his kind help during my studies. I would also like to thank Prof. CHEN Guoliang, Prof. MAO Rui and some other professors in the Shenzhen University for providing valuable research resources during my three years at Shenzhen University. I thank Prof. LEUNG Yiu Wing, Prof. NG, Joseph Kee Yin, Dr. LIU Hai, Dr. LU Haiping, Prof. CHEUNG, Yiu-Ming and some other professors in our department, who gave me valuable advice and comments on my research work. My sincere thanks go to my labmates Dr. MEI Xinxin, Dr. ZHAO Kaiyong, Mr. WANG Qiang, Mr. SHI Shaohuai and Mr. WANG Canhui. You made a comfortable environment for our studies. In particular, I am grateful to WANG Qiang for his insightful help in completing some work on GPU; Dr. MEI Xinxin for resolving my latex issues; SHI Shaohuai for providing a good living environment. It s really a good time with your accompany. At last, I want to thank my parents and my elder sisters. Their kind support is the key reason that I can continue my research work in all the years. iv

8 Table of Contents Declaration i Abstract ii Acknowledgements iv Table of Contents v List of Tables ix List of Figures x Chapter 1 Introduction Notation and Nomemclature Erasure Coding Recovery of Erasure-coded Storage Systems In-memory storage Thesis goals and contributions Organization Chapter 2 Background and Related Work Erasure coding GPU Computing Recovery of Erasure-Coded Storage Reliability of In-memory Storage v

9 Chapter 3 G-CRS: GPU Accelerated Cauchy Reed-Solomon Coding Introduction Cauchy Reed-Solomon Coding Design of G-CRS Baseline Implementation Optimization Strategies Performance Model Kernel Analysis Dominant Factor Analysis Pipelined G-CRS Performance Evaluation Throughput Under Different Workloads Peak Raw Coding Performance Optimization Analysis Overall Performance Summary Chapter 4 ESet: Placing Data towards Efficient Recovery for Largescale Erasure-Coded Storage Systems Problem Definition System Model Problem Illustraion Problem Formulation Reliability Analysis Revisiting Failures Our Solution: ESet Reliability Constraint Design of ESet Grouping for Reliability Generation of ESets vi

10 4.3.3 Recovery of a Failed Host Performance Evaluation Evaluation Overview Recovery I/O Parallelism Analysis Recovery Performance of Simulating A Year Failure Recovery Performance of With Different λ Values Recovery Performance of Burst Failures in An Hour Summary Chapter 5 R-Memcached: a Reliable In-Memory Cache for Big Key- Value Stores Background Introduction to Memcached Reliability Challenge Design and Implementation of R-Memcached System Architecture RAIM Implementation Set, Get and Delete in R-Memcached Asynchronous Update and Degrade Read Reliability Analysis RAIM Set Reliability Reliability of a R-Memcached Cluster Performance Evaluation of R-Memcached Testbed and performance baseline Evaluation of RAIM Evaluation of RAIM Evaluation of RAIM Summary vii

11 Chapter 6 ESetStore: Introducing Fast Data Recovery to the Erasure- Coded Storage System System Architecture of ESetStore The Design and Implementation of ESetStore ECMeta: the Metadata Service Efficient Read and Write Operations Fast Recovery with ESet Evaluation Experimental Setup Read and Write Throughput Recovery Performance Recovery Performance with PPR Summary Chapter 7 Conclusions Future Research Directions Bibliography 120 Curriculum Vitae 132 viii

12 List of Tables 3.1 Features of different types of GPU memory space Categorize the operation type by SM and DRAM Metrics and parameter settings of different GPUs Major Parameters for Measuring Different Workloads Main Notations Data Center Configuration of Simulation Parameters for A Year Simulation Parameter for RAIM Reliability Analysis ix

13 List of Figures 1.1 The Process of Writing a File to an Erasure-Coded Storage System The Architecture of SMs in GTX 980 [17] Illustration of CRS encoding. d i T is the w-row format of data chunk d i. c i T is the w-row format of parity chunk c i A concrete example of CRS encoding. k =2,m=2,w = Input and Output for Coding Abstract Memory Hierarchy of a GPU Access pattern to bitmatrix. tid refers to thread ID Mapping of each bitmatrix element to memory Memory access pattern for a single warp Access pattern of w threads on the data blocks An example of decoding missing blocks (a) Memory throughput with different r (b) Model accuracy Cases for Pipelined I/O and Computation Throughput under Different Workloads Comparison of raw encoding performance Optimization analysis Dominating factor analysis Overall encoding performance Physical Layout of Storage System A naive example of the data distribution x

14 4.3 ESets Distributed Storage System Architecture with ESet Time For Recovering a Failed Host Reliability of Storage System Reliability From Independent Domain Groups in System An example of hosts mapped to ESets Simulator Overview Recovery I/O Parallelism for n=14 and k= Normalized Recovery Performance for A Year Simulation Normalized Recovery Performance for Different λ Normalized Recovery Performance of Burst Failures in An Hour Illustration of Memcached in Web Service System Architecture of R-Memcached RAIM SETs of RAIM-1, RAIM-5 and RAIM Node failure detection in R-Memcached Memory Layout in R-Memcached Get, Set and Delete Operations Reliability of a single RAIM Set Reliability of RAIM Cluster R-Memcached Testbed Throughput and Latency for RAIM Throughput and Latency for RAIM Throughput and Latency for RAIM Main Components of ESetStore File Management on ECMeta Procedure of Create Opeartion Organization of a file and its stripes Example of Read and Write Operations xi

15 6.6 Example of Read and Write Operations with Overlapped I/O and Coding The Procedure of Write Operation in ESetStore The Procedure of Read Operation in ESetStore Testbed of ESetStore Read Throughput in ESetStore Write Throughput in ESetStore Recovery Performance in ESetStore Recovery Performance in ESetStore xii

16 Chapter 1 Introduction The past decade has witnessed the rapid growth of data in large-scale distributed storage systems. Taking ECMRWF storage as an example, which published in Conference on File and Storage Technologies in 2015 [29]. Its storage capacity already reached 100 PB and the data amount in its storage increased at a rate of about 45%. A recent work revealed genomic big data have reached the full storage size of a data center with 100 PB storage capacity[66]. Protecting such a huge amount of data with reliability mechanism like triplication[28] introduce heavy storage cost due to 3x storage overhead. As a consequence, erasure codes have been introduced by many large-scale storage systems as a reliability mechanism with reduced storage overhead and higher reliability. A first example of adopting erasure coding is the Microsoft cloud service Windows Azure Storage [35]. Its storage overhead is 10/6x with Local Reconstruction Codes. Facebook's warehouse [73] and Web service storage system f4 [59] also taking erasure coding as its reliability mechanism. The f4 encodes each ten data blocks into four parity blocks, indicating its storage overhead is 1.4x. The quantcast file system fixed data block as 6 and parity blocks as 3 with the 1.5x storage overhead[65]. In addition, distributed file systems such as Hadoop [22] and Ceph [94] have begun to support erasure coding to yield a higher reliability and lower storage overhead. A general erasure coding system works as follows. Initially, the user data to be protected is divided into k equal sized data chunks. The encoding operation 1

17 gathers all k data chunks and generates m equal sized parity chunks according to an encoding algorithm. In a distributed storage system, the set of n=k+m data and parity chunks are usually stored at different hardware devices to prevent data loss due to device failures. When no more than m devices fail out of these n devices, the chunks in the failed devices become unavailable. To recover the lost data, a decoding operation gathers k available chunks and reproduces the missing chunks according to a decoding algorithm. The erasure codes that can restore m missing chunks from the remaining k alive chunks have the highest error correction capability and are called Maximum Distance Separable (MDS) codes [7]. Reed-Solomon (RS) coding [77] and its variant Cauchy Reed-Solomon (CRS) coding [8] are the two well-known general MDS codes that can support any values of k and m. A file D 0 D 1 D k-1 A stripe D 0 D 1 D k-1 P 1 P m-1 D 0 D 1 P m-1 Host 0 Host 1 Host n-1 Rack 0 Rack 1 Rack n-1 Figure 1.1: The Process of Writing a File to an Erasure-Coded Storage System Fig. 1.1 presents the process of writing a file to an erasure-coded storage system. The file first divided into k equal size data blocks. Then the k data blocks generates m parity blocks. The n blocks together called a stripe. The stripe is distributed into n disks from n hosts belong to n different racks to tolerate disk-level, host-level and rack-level failures. When no more than m disks, hosts or racks failed. We can use any k available blocks to restore any missing block from the failed component. An erasure-coded storage system carries many stripes to provide reliable storage. 2

18 Erasure-coded storage systems confronted many performance degradation factors. The first one is that encoding and decoding operations are computationalintensive tasks, indicating a long coding time. Another major issue is that recovery typically takes a long time due to heavy I/O operations. As for recovering one block, k blocks will be retrieved. This motivates us to present our work, which consists of optimization work to improve the performance of large-scale erasure-coded storage systems. 1.1 Notation and Nomemclature In this section, we give an explanation to erasure codes, stripes and the name of reliability mechanism for in-memory storage. Erasure Codes In general, an erasure code is a mechanism to recover missing data. There are many types of erasure codes designed for special scenarios. In our thesis, we only consider thecaseofaforementionedmds [7] codes. The next paragraph gives a detail of how it works. We use three parameters to define an erasure code: n, k, andm, wheren equals k+m. Erasure coding, which consists of encoding and decoding, is major operations for erasure codes. k equal size data chunks served as the input for encoding. A whole set of n chunks will be served as the output of encoding, where the whole input included in the output. The rest m chunks are called parity chunks. Any k distinct chunks can be served as the input of decoding operation. The decoding operation can generate up to n chunks of the whole set. Stripe Stripe is the basic unit for managing erasure-coded data. We have mentioned the concept of stripe in Fig Here we make a formal decription of it. A stripe consists of n equal size blocks, where k blocks are data blocks and the rest of n-k blocks are parity blocks. An erasure-coded storage system can guarantee its data reliability as long as each stripe has at most n-k missing blocks. The data reliability 3

19 of a distributed system depends on the number of stripes and the data reliability of each stripe, which in turn depends on the values of n and k. Usually, there is a trade-off between the reliability and storage overhead n/k. RAIM RAIM, an abbreviation for Redundant Array of Independent Memory, was recently introduced by IBM[53]. RAIM works in a similar fashion in the memory as RAID in the disk in order to tolerate a certain level of memory channel failures. In our thesis, we extend the concept of RAIM from a single physical server to a distributed system. 1.2 Erasure Coding Erasure coding is crucial for the quality of service and user experience as aforementioned. Modern data centers begin to deploy Gb/s Ethernet or even Infini- Band FDR/QDR/EDR to improve the network speed [6], and disk arrays based on Solid-State Drives (SSD) to improve the disk input-output (I/O) performance [55]. Such technology trend pushes the computationally expensive erasure coding into a potential performance bottleneck in erasure-coded storage systems. Recently, graphics processing units (GPUs) have been used in some storage systems to perform different computationally expensive tasks. Shredder [5] is one framework used for leveraging GPUs to efficiently chunk files for data deduplication and incremental storage. GPUstore[87] is another framework for integrating GPUs into storage systems for file-level or block-level encryption and RAID 6 data recovery. Another GPU-based RAID 6 system has been developed in [41], which uses GPUs to accelerate two RAID 6 coding schemes, namely Blaum-Roth and Liberation codes. This system achieves a coding speed of up to 3GB/s. However, RAID 6 only supports up to two disk failures and is not suitable for large-scale systems. To date, Gibraltar [20], which employs table lookup operations to implement Galois field multiplications, is the most successful GPU-based Reed-Solomon coding library; notably, the system has much higher speed over the single-thread Jerasure 4

20 [70], which is the most popular erasure coding library for CPUs. PErasure[14] is a recent CRS coding library for GPUs and its performance is even better than Gibraltar. However, PErasure does not fully utilize the GPU memory system and results in sub-optimal performance. With the rapid improvement of networking speed and aggregated disk I/O throughput, there is a demand to further improve the coding performance. Data reliability and availability are critical requirements for data storage systems. Although coding-based RAID 5/6 have become the industry standards for decades, replication is still the de facto data protection solution in large-scale distributed storage systems. With the increase in the amounts of data and the deployment of expensive SSDs, there is a great opportunity for erasure coding because it can provide much lower storage overhead and higher reliability compared with data replication. In [47], a comprehensive comparison of erasure coding and replication was presented. However, erasure coding is a compute- and data-intensive task which brings practical challenges to its adoption in distributed storage systems. In an erasure coded distributed storage system, there could be three potential performance bottlenecks: the aggregated disk I/O throughput, the network bandwidth, and the coding performance. Modern data centers have started to deploy high-speed network switches with more than 40Gb/s of bandwidth per network port [6]. Facebook and LinkedIn are already working on 100Gb/s network for their data centers [37][100]. Meanwhile, the sequential I/O throughput of a single SSD has been improved to more than 4Gb/s [92], and the aggregated I/O throughput of a disk array can easily match the network bandwidth. However, the throughput of software implemented erasure coding is inversely proportional to the number of parity chunks, and is typically less than 10 Gb/s on multi-core CPUs [14], which makes erasure coding impractical for large-scale distributed systems. Modern GPUs have tens of TFlops computation power and an internal memory bandwidth of a few hundreds of GB/s, providing an opportunity to speed up erasure coding to saturate the disk I/O and network bandwidth. This motivates us to design and implement 5

21 G-CRS to fully utilize the GPU power and achieve a high coding throughput. 1.3 Recovery of Erasure-coded Storage Systems Many recent studies have revealed that data centers are frequently suffering disklevel [80][67] and host-level [63][31] failures[79][61][30][27]. This makes data recovery part of daily work for storage systems. For example, Facebook warehouse transfers around 100 TB of data each day to recover data from its failed disks and hosts [73]. Data placement scheme plays a critical role in erasure-coded storage systems. First of all, the data placement scheme should guarantee very high data reliability and availability that can withstand a certain level of disk failures, host failures, and rack failures. Secondly, it should achieve a desirable data recovery throughput. However, recovering a failed host or disk is very time consuming for erasure-coded storage systems[93]. Using the popular Reed-Solomon code as an example, to recover an individual data block requires to fetch a total of k blocks. Consider a scenario that a data center needs to recover a failed disk with 1 TB of data, and k = 10. Then it requires to access 10 TB of data to recover the failed disk. Assume the disk I/O throughput is 100 MB/s, then the recovery time will be at least 10 5 seconds if the 10 TB of data blocks can only be accessed sequentially. To reduce the recovery time, it is imperative to distribute those 10 TB of data blocks among many different disks at the very beginning, and hence we can exploit parallel disk I/O to speed up the data recovery process. Another approach to reducing data recovery time is to design new erasure code that requires less data to recover a failed data block. In this work, we focus on exploiting I/O parallelism. A simple yet popular data placement scheme is random data placement, which distributes data and code blocks randomly. This makes storage systems able to aggregate I/O from more than k hosts or disks to recover a failed host or disk. However, the recovery performance may not be always guaranteed. E.g., Facebook s F4 took two days to recover the data on a host as the data and code blocks for the 6

22 failed hosts are not well distributed [59]. Another side-effect is that, during the recovery period (a.k.a. degraded read), service latency can become 10x than normal mode and service quality decreases greatly. To guarantee recovery performance for each host, a data placement algorithm must ensure that adequate hosts can participate in the recovery process. However, this aspect is neglected by existing studies. This motivates us to design a placement algorithm to introduce efficient recovery. 1.4 In-memory storage In recent years, key-value stores have been widely used in many commercial largescale Web-based systems, including Amazon, Facebook, YouTube, Twitter, GitHub, and Linkedin. In order to reduce the data access latency caused by disk I/O, an inmemory cache system is usually deployed between the front-end Web system and the back-end database system. For example, Facebook is using a very large distributed in-memory cache system built from the popular Memcached [26] [62], which consists of thousands of server nodes [58]. While in-memory storage can greatly improve the performance of the key-value store, there are optimizations to further improve its performance. LSM-trie is an inmemory key-value store for small data [97]. Some studies also tried flash memory to provide similar performance while reducing the per-bit cost of storage [83] [84][96]. This indicates performance is a key concern to in-memory storage. In a large-scale in-memory key-value store system, node failure becomes very common [88], which may seriously affect the access latency and user experience. How to improve the reliability of the distributed cache system becomes an important issue. Redundancy techniques such as RAID [12] have been widely used in hard diskbased storage systems to offer fault-tolerance. RAID-1 is basically the same as data replication, which achieves very good reliability but requires a double cost. RAID- 5 and RAID-6 improve the storage efficiency at the expense of decreasing access 7

23 performance when faced with disk failures. But currently, how these redundancy techniques work in memory is still unknown and worth studying. 1.5 Thesis goals and contributions The goals of this thesis are in the following: 1. We aim to improve the performance of time-consuming encoding and decoding of erasure coding with the help of GPU. 2. We study placement algorithm to provide reliable storage and improved recovery performance for erasure-coded storage systems. 3. We investigate the performance of applying coding techniques to harvest reliable in-memory storage. 4. We propose a prototye named ESetStore for erasure-coded storage systems. The ESetStore brings efficient recovery with the integration of our data placement algorithm ESet. The overall objective of this thesis is to improve the performance of large-scale erasure-coded storage systems. Towards this objective and the goals we mentioned above, this thesis makes the following contributions: We propose a graphics processing unit (GPU)-based implementation of erasure coding named G-CRS, which employs the Cauchy Reed-Solomon (CRS) code. We designed and implemented a set of optimization strategies, such as a compact structure to store the bitmatrix in GPU constant memory, efficient data access through shared memory, and decoding parallelism, to maximize coding performance by fully utilizing the GPU resources. In addition, we derived a simple yet accurate performance model to demonstrate the maximum coding performance of G-CRS on GPU. We evaluated the performance of G-CRS through extensive experiments on modern GPU architectures such as Maxwell 8

24 and Pascal, and compared with other state-of-the-art coding libraries. The evaluation results revealed that the throughput of G-CRS was 10 times faster than most of the other coding libraries. We present a data placement strategy named ESet which brings recovery efficiency for each host in a distributed storage system. We define a configurable parameter named overlapping factor for system administrators to easily achieve desirable recovery I/O parallelism. Our simulation results show that ESet can significantly improve the data recovery performance without violating the reliability requirement by distributing data and code blocks across different failure domains. We present the design, implementation, analysis, and evaluation of R-Memcached, a reliable in-memory key-value cache system that is built on top of the popular Memcached software. R-Memcached exploits coding techniques to achieve reliability, and can tolerate up to two node failures. Our experimental results show that R-Memcached can maintain very good latency and throughput performance even during the period of node failures. We present the design, implementation and evaluation of ESetStore, a prototype for erasure-coded storage systems. ESetStore can harvest good read and write performance. The evaluation demonstrates ESetStore can bring efficient recovery with our designed data placement algorithm ESet. 1.6 Organization The remainder of this thesis is organized as follows: Chapter 2 introduces the background and provide an overview of the related literature. Chapter 3 provides the design and implementation of our G-CRS at length. This chapter is based on joint work with Mr. WANG Qiang, Dr. CHU Xiaowen, Prof. 9

25 LEUNG Yiu-Wing and has been published in IEEE Transactions on Parallel and Distributed Systems in 2018 [51]. Chapter 4 studies the placement algorithm for erasure-coded storage systems. We demonstrate how our proposed ESet brings recovery efficiency for each host in an erasure-coded large-scale storage system. This chapter is based on joint work with Dr. LIU Hai, Dr. CHU Xiaowen, Prof. LEUNG Yiu-Wing and has been presented at International Conference on Computer Communication and Networks in 2016 [48]. Chapter 5 investigates the performance of applying coding techniques to in-memory storage. The R-Memcached consists of RAIM-1, RAIM-5, and RAIM-6. It can tolerate up to two node failures while maintaining good latency and throughput performance. This chapter is based on joint work with Dr. OUYANG Kai, Dr. LIU Hai, Dr. CHU Xiaowen, Prof. LEUNG, Yiu-Wing and has been presented in International Conference on Big Data Computing and Communications[50] and published in Tsinghua Science and Technology in 2015 [49]. Chapter 6 presents the design, implementation, and evaluation of ESetStore. The evaluation results demonstrate ESetStore can achieve good read and write performance meanwhile harvesting fast data recovery. Chapter 7 draws the conclusion and present some future directions of our research. 10

26 Chapter 2 Background and Related Work In this chapter, we present the background and some related work of our thesis. We first present related literature for erasure coding. Then we introduce the GPU computing. At last, we talk about the recovery of erasure-coded storage and inmemory reliability. 2.1 Erasure coding Erasure coding is performed when writing data to storage systems and recovering any missing data in storage systems. Thus its performance is crucial to the erasure-coded storage system. Over the past decades, many research works have been proposed to improve the performance of erasure coding. One pioneer study optimizes the Cauchy distribution matrix that results in better coding performance for CRS coding on the CPU by performing less XOR operations[71]. Jerasure [70] is a popular library that implements various kinds of erasure codes on the CPU, including optimized CRS coding. Optimization with efficient scheduling of XOR operations on CPU is presented in [52]. These sequential CRS algorithms are designed for CPUs and are not suitable for GPUs due to their complicated control flows. Another thread of research aims to exploit parallel computing techniques to speed up erasure coding. For multi-core CPUs, CRS codes have been parallelized in [42, 86]; EVENODD codes have been parallelized in [23]; 11

27 and RDP codes have been parallelized in [24]. A fast Galois field arithmetic library for multi-core CPUs with the support of Intel SIMD instructions has recently been presented in [68]. Although these works have achieved great improvement, their coding performance is still not comparable to the throughput of today s high-speed networks, especially when a large number of parity data chunks are required for higher data reliability. These parallel algorithms cannot be directly applied to GPUs either, due to their different hardware architectures. For many-core GPUs, the Gibraltar library [20] implements the classical Reed-Solomon coding and outperforms many existing coding libraries on CPUs. PErasure[14] is a recent CRS coding library for GPUs and its performance is better than Gibraltar. However, PErasure does not fully utilize the GPU memory system and results in sub-optimal performance. 2.2 GPU Computing Modern GPUs are typically equipped with hundreds to thousands of processing cores evenly distributed on several streaming multiprocessors (SMs). For example, the Nvidia GTX 980 with Maxwell architecture contains 16 SMs and 4GB off-chip GDDR memory named global memory. Fig. 2.1 presents the layout of streaming multiprocessors of GTX 980 from [17]. Each SM has 128 Stream Processors (SPs, or cores) and a 96-KB on-chip memory named shared memory, which has much higher throughput and lower latency than the off-chip GDDR memory [54]. Besides the 2MB L2-cache shared by the whole GPU, each SM also has a small amount of on-chip cache to speed up the data access of read-only constant memory. Fig. 3.4 present an abstraction of a GPU s memory hierarchy. Currently, CUDA is the most popular programming model for GPUs [19]. A typical CUDA program comprises host functions, which are executed on the central processing unit (CPU), and kernel functions, which are executed on the GPU. Each kernel function runs as a grid of threads, which are organized into many equal sized thread blocks. Each thread block can include a set of threads distributed in a number 12

28 Figure 2.1: The Architecture of SMs in GTX 980 [17] of thread warps, each of which has 32 threads that execute the same instruction at a time. Threads in a thread block can share data through their shared memory and perform barrier synchronization. 13

29 Modern GPUs are embedded with hundreds to thousands of arithmetic processing units that provide tremendous computing power, and attracts many work to port computationally intensive applications from CPUs to GPUs. For example, G- Blastn [102], which is a nucleotide alignment tool for the GPU, achieves more than 10 times of performance improvement over its CPU version NCBI-BLAST. The accleration of dynamic graph analytics is presented in [81]. In [91], regular expression matching on GPUs can be 48 times faster than that on CPUs. A high performance key-value store with GPU is presented in [101]. SSLShader [36] and AES [44] both demonstrate great performance improvement when data encryption algorithms are offloaded to GPUs. Network coding on GPUs, such as those in [15], [13] and Nuclei [85], are the most closely related work in addition to the aforementioned Gibraltar [20] and PErasure [14]. These studies aim to improve the performance of network coding to match the throughput of high speed networks for both encoding and decoding. 2.3 Recovery of Erasure-Coded Storage Recovery is a time-consuming task in an erasure-coded storage system due to intensive I/O operation as k chunks will be retrieved to recover a failed chunk. Researchers tried to optimize recovery from two aspects. One is that reduce I/O operations when performing recovery. Another one makes a better utilization of available I/O resources. Reducing I/O operations is a commonly used method for improving recovery performance. A replace recovery algorithm is proposed in [103] to reduce I/O operations in XOR-based erasure codes. A solution to reduce required symbols for recovering is proposed for RDP code [98]. Hitchhiker[74] is a code mechanism e- volved from the family of regeneration codes [75] to reduce the network traffic and disk IO by around 25% to 45% during the reconstruction process. There will be k storage servers to recover a missing chunk or a failed storage server. Partial Parallel Repair is a technique for better utilizing these storage servers 14

30 for reconstruction with pipelined I/O mechanism [57]. ECPipe[45] is a pipelined I/O mechanism for improving recovery performance in a heterogeneous environment by effectively utilizing available I/O resources. The essential goal of a storage system is to store data and provide quality of service for data access. When a storage system scales to more than thousands of hosts and tens of thousands of disks, failures are daily happened events. To prevent data loss caused by failures, data recovery is being performed every day. During the data recovery process, the accesses to the lost data (called degraded reads) suffer long latency due to data decoding operations. This makes recovery performance play a crucial role in storage systems to keep the quality of service. A report [78] revealed that after a major failure event, some 60% needs more than 4 hours to restore its service due to necessary recovery. And around 20% needs more than a whole day to come back to normal service. This indicates that a storage system must take recovery performance into consideration when making data layout optimization, otherwise the system service quality would be heavily damaged. Recovery performance is an essential design goal for data placement algorithm[1]. Consider a storage system with m disks, when a disk fails, the recovery performance is optimal if m-1 disks can participate in its recovery and each disk contributes equal I/O for recovering the failed disk. For replication based storage systems, stein system can be applied to achieve optimal recovery performance for a certain number of hosts. The optimal solution is achieved for recovery when some conditions are satisfied [43]. However, it only works with some special numbers of hosts [40][34]. Thus, researchers are trying to design near-optimal stein system. A near optimal parallelism is proposed in [1] for storage systems with a few disks. In [82], it gives a data placement algorithm for replication to obtain optimal parallelism for replication for disks. Copysets [16] addresses the issue of data loss for replication based storage systems by designing near-optimal stein system. It uses the scatter width to represent how many I/O parallelisms a host can obtain for its recovery. It selects hosts from 15

31 different racks to form a group to avoid concurrent failures in the same group. Some permutations are made in the group and it makes each host distribute its replicates across hosts in the group so that each host can obtain near optimal recovery performance. When a host fails, all other hosts may contribute near equal I/O to recover the failed host. However, Copysets is not directly applicable to erasure-coded storage systems, because to restore a data block, only one host is needed for replication, but k hosts are required for erasure-coding. Random data placement algorithms, such as RUSH [32][33], CRUSH [95], Random Slicing [56], can work with both replication based and erasure-coded storage system. These algorithms can guarantee reliability by randomly placing replicates across all failure domains, where each failure domain contains a set of disks or hosts that may become unavailable when a shared component failed. But they mainly focus on how to distribute data evenly on large-scale storage systems. Distributing data to obtain good recovery performance for each host is overlooked. Intuitively the number is large enough to achieve good recovery performance. However, due to randomness, the number of hosts participating in recovering a failed host can not be configured. And the recovery performance of each host may not satisfy the requirement of storage systems. Some storage systems map each host to many virtual storage nodes to accelerate host s recovery performance. When a host fails, all other hosts can participate in its recovery. However, it makes storage systems easy to suffer data loss when more than n-k disks fail concurrently. In summary, existing data placement algorithms cannot make an efficient recovery for erasure-coded storage systems meanwhile guarantee system reliability, which would further jeopardize the service quality and reliability of a storage system. This motivates our work to design a placement algorithm for large-scale erasure-coded storage systems to bring an efficient recovery performance and ensure system reliability. 16

32 2.4 Reliability of In-memory Storage Storing metadata in main memory is the key to improve system performance like Facebook s storage system Haystack [4]. There are also other storage systems store entire data in memory to provide high performance storage. Storing data in memory greatly reduces accessing latency and improves throughput compared with diskbased storage. Examples of in-memory storage include Memcached[25], Redis[76], RAMCloud[64]. Memcached is an in-memory cache system used to reduce data access latency for key-value stores. It has been employed by Facebook to improve its Web service performance. Redis is also an in-memory key-value store with disks as a backup for persistent storage. RAMCloud offers low-latency access to large-scale datasets as it stores all data in DRAM. However, these systems facing various kinds of tradeoff of applying reliability mechanism to offer reliable in-memory storage. A straightforward and easy implement reliability mechanism is to write data to disk to prevent data loss against any kind of failure. This approach is adopted by systems like Redis[76]. It writes data to disk as the backup when storing data in memory. When performing data recovery, the data is reloaded from disk to memory. Using this approach can greatly save the storage cost as disk storage is quite cheap compared with in-memory storage. However, writing the backup to disk can ensure reliability but at the cost of huge performance penalty when encountering failure[4]. RAMCloud keeps a copy of each object in the memory, which doubles the storage overhead of memory storage [63]. The storage cost will be increased greatly. As a consequence, coding techniques have been studies to ensure reliability with lower storage cost while reducing performance penalty caused by failures [11] [72] [99]. In summary, the expensive per bit storage of memory storage making reliable in-memory storage encounters tradeoff. It needs to reduce storage cost while maintaining access performance at a certain level. This motivates us to study the performance degradation of applying coding techniques to in-memory storage. To this end, this motivates us to design and implement G-CRS, a GPU acceler- 17

33 ated Cauchy Reed-Solomon coding library that can match the performance of disk I/O and network bandwidth. We design a data placement algorithm named ESet to improve the recovery performance of erasure-coded storage systems to the desired level. We designed and implemented R-Memcached to apply coding techniques to in-memory storage. The work can be a reference to choose a proper coding scheme for in-memory storage. Last but not least, we design and implement ESetStore, a prototype of the erasure-coded storage system with fast data recovery. 18

34 Chapter 3 G-CRS: GPU Accelerated Cauchy Reed-Solomon Coding In this chapter, we present a graphics processing unit (GPU)-based implementation of erasure coding named G-CRS, which employs the Cauchy Reed-Solomon (CRS) code, to improve erasure coding performance. To maximize the coding performance of G-CRS, we designed and implemented a set of optimization strategies, such as a compact structure to store the bitmatrix in GPU constant memory, efficient data access through shared memory, and decoding parallelism, to fully utilize the G- PU resources. In addition, we derived a simple yet accurate performance model to demonstrate the maximum coding performance of G-CRS on GPU. We evaluated the performance of G-CRS through extensive experiments on modern GPU architectures such as Maxwell and Pascal, and compared with other state-of-the-art coding libraries. The evaluation results revealed that the throughput of G-CRS was 10 times faster than most of the other coding libraries. Moreover, G-CRS outperformed PErasure (a recently developed, well optimized CRS coding library on the GPU) by up to 3 times in the same architecture. 19

35 3.1 Introduction In this chapter, we present the design of a new CRS coding library for GPUs, namely G-CRS, that can fully utilize the GPU resources and deliver high coding performance that can saturate the state-of-the-art network speed. To this end, we have designed new data structures and a set of optimization strategies. G-CRS can achieve more than 50GB/s of raw coding performance on a modern GPU Nvidia Titan X for the case of m = 16 (i.e., the system can withstand up to 16 device failures) 1. For the optimization of G-CRS, we present a step-by-step optimization analysis to reveal the method of utilizing GPUs to accelerate CRS coding. We believe our work can be beneficial to other algorithms and applications with similar data access and computational patterns. A propose a simple yet accurate performance model to understand the major factors that affect the coding performance. The pipelined mechanism is investigated to enable our G-CRS to achieve a peak performance by efficiently overlapping data copy operations and coding operations. 3.2 Cauchy Reed-Solomon Coding kw kw mw x 0,0 x 0,1 x 0,kw-1 x mw-1,0 x mw-1,1 x mw-1,kw-1 d 0 T d 1 T d k-1 T Data = d 0 T d 1 c 0 T d k-1 T T c m-1 T Generator matrix Data+Codeword Figure 3.1: Illustration of CRS encoding. d i T is the w-row format of data chunk d i. c i T is the w-row format of parity chunk c i. 1 The source code and experimental results of our G-CRS are available at hkbu.edu.hk/~chxw/gcrs.html. 20

36 As illustrated in Fig. 3.1, the encoding procedure of CRS takes k equal sized data chunks d 0,d 1,..., d k 1 as input, and generates m parity chunks c 0,c 1,..., c m 1 as output. To perform the coding, it needs to select an integer parameter w that is no less that log 2 (k + m). Hence a CRS code can be defined by a triple (k, m, w). CRS first defines an m k Cauchy distribution matrix over Galois Field GF (2 w ), and then expands it into a (k + m)w kw generator matrix over GF (2) whose elements are either 1 or 0 [8]. Notice that the top kw rows of the generator matrix is an identity matrix. Each data chunk d i needs to be transformed into w rows, denoted by D i,0,d i,1,..., D i,w 1.Thew-row format of data chunk d i is denoted by d i T. Then all the k data chunks can be combined into a data matrix with kw rows. By multiplying the generator matrix and data matrix over GF (2), we can get k +m output chunks that include the k original data chunks (due to the kw kw identity sub-matrix) and the m parity chunks. Notice that the multiplication in GF (2) can be implemented by efficient bit-wise XOR operations, which is the major property of CRS D 0,0 D 0,1 D 1,0 D 1,1 C 0,0 D 0,0 D 0,1 D 1,1 C 0,1 D 0,0 D 1,0 D 1,1 C 1,0 D 0,1 D 1,0 D 1,1 C 1,1 D 0,0 D 0,1 D 1,0 Figure 3.2: A concrete example of CRS encoding. k =2,m =2,w =2. Fig. 3.2 presents a concrete example of the CRS encoding process where k=2, m=2 and w=2. Data chunk d i consists of two rows: D i,0 and D i,1, i =0, 1. In the actual encoding process, we only need to calculate the m parity chunks using the bottom mw rows of the generator matrix. For each parity chunk, the values from the corresponding row vector in the generator matrix determine which data chunks will be involved in the XOR operations. For example, the fifth row vector from the generator matrix in Fig. 3.2 < 1, 1, 0, 1 > determines that C 0,0 is generated from D 0,0,D 0,1 and D 1,1 with XOR operations. 21

37 When either a data chunk or a parity chunk becomes unavailable due to device failure, a decoding operation is triggered to restore the missing chunk. To recover a parity chunk, w row vectors from the generator matrix together with all data chunks serve as the input. By contrast, to recover a data chunk, an inverse matrix is generated from the generator matrix (which can be done offline in advance), and k alive chunks serve as the input with w row vectors from the inverse matrix. The encoding and decoding operations are essentially the same in terms of data access pattern and computation. Therefore, we use the term coding to represent both encoding and decoding. 3.3 Design of G-CRS In this section, we first present a high-level view of G-CRS and define our terminologies. Next, we provide a baseline implementation of CRS coding on GPUs that directly migrates from the CPU version. Subsequently, we analyze the potential drawbacks of this basic design and provide a set of optimization strategies to accelerate CRS coding on GPUs. Our G-CRS is implemented by applying all optimization strategies described in this section. Fig. 3.3 illustrates the system architecture of G-CRS that implements a (k, m, w) CRC code. The bitmatrix stores the bottom mw rows of the generator matrix of the CRS code. The input data is divided into k equal sized data chunks, and the output includes m parity chunks with the same size. We use s to represent thenumberofbytesofalong data type on the target hardware platform, i.e., s = sizeof(long). We define a packet as s consecutive bytes. The XOR of two packets can then be efficiently carried out by a single instruction. We define a data block as w consecutive packets, where w is the parameter of CRC and should be no less than log 2 (k + m). The number of data blocks in a chunk is denoted by N. We summarize the high level workflow of G-CRS in Algorithm 1. When the size of user data is greater than the available GPU memory, we will encode the data in different rounds. 22

38 1 global void crs coding kernel naive 2 ( long in, long out, int bitmatrix, 3 int size, int k, int m, int w) { 4 int blockunits = blockdim.x / w; 5 int blockpackets = blockunits w; 6 int tid = threadidx.x + blockidx.x blockpackets ; 7 int unit id offset = tid / w w; 8 int unit in id = tid % w, i, j ; 9 10 if (threadidx.x >= blockpackets) return ; 11 if (tid >= size ) return ; int index = threadidx.y k w w+unit in id k w; 14 long result = 0, input ; for (i = 0; i < k ; ++i ) { 17 for (j = 0; j < w ; ++j ) { 18 if (bitmatrix [ index ] == 1) { 19 input = (in+size i+unit id offset+j); 20 result = resultˆinput ; 21 } 22 ++in d e x ; 23 } 24 } (out + size threadidx.y + tid) = result ; 27 } Listing 1: A baseline implementation of CRS coding on GPU Baseline Implementation Listing 1 presents the baseline implementation of the CRS coding kernel written in CUDA, which is directly migrated from a CPU version. The kernel function 23

39 long *in Input data of L bytes (e.g., a file) Data Chunk 0 mw kw bitmatrix Packet 0 Packet 1... Packet w-1 Block 0 Data Chunk 1 Packet 0 Packet 1... Packet w-1 Block 1 Packet 0 Packet 1... Packet w-1 Block N-1 long *out Parity Chunk 0 Parity Chunk 1 Parity Chunk m-1 Data Chunk k-1 Figure 3.3: Input and Output for Coding defines the behavior of a single GPU thread. In this implementation, each thread is responsible for encoding a single packet. When the kernel function is launched, a total of mwn threads will be created to generate the m parity chunks in parallel. Each element of bitmatrix is represented by an integer. This kernel function works as follows: (1) The input buffer in and the output buffer out are located in the GPU global memory. (2) The mw bottom row vectors from the generator matrix are stored in the bitmatrix, which is also located in the global memory. (3) Each thread calculates the initial index of its assigned packet in each data block (line 13). Then, each thread iterates its corresponding row in the bitmatrix to determine the data required to perform the XOR operation (lines 16-24). (4) Each thread writes a packet, which is stored in the variable result, to the output buffer (line 26). Some severe performance penalties exist in this baseline implementation, implying that the GPU resources are substantially under-utilized. To design a fully optimized version of CRS coding on GPU, a thorough understanding of the GPU architecture is required, including its memory subsystem and its method of handling branch divergence. We have identified three major performance penalties of the baseline implementation: Inefficient memory access: The memory access pattern of the baseline kernel implementation in its global memory causes a considerable performance penalty. 24

Repair Pipelining for Erasure-Coded Storage

Repair Pipelining for Erasure-Coded Storage Repair Pipelining for Erasure-Coded Storage Runhui Li, Xiaolu Li, Patrick P. C. Lee, Qun Huang The Chinese University of Hong Kong USENIX ATC 2017 1 Introduction Fault tolerance for distributed storage

More information

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures Haiyang Shi, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda {shi.876, lu.932, panda.2}@osu.edu The Ohio State University

More information

Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Memory Hierarchy of Modern GPUs 1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong

More information

Efficient Load Balancing and Disk Failure Avoidance Approach Using Restful Web Services

Efficient Load Balancing and Disk Failure Avoidance Approach Using Restful Web Services Efficient Load Balancing and Disk Failure Avoidance Approach Using Restful Web Services Neha Shiraz, Dr. Parikshit N. Mahalle Persuing M.E, Department of Computer Engineering, Smt. Kashibai Navale College

More information

Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques. Tianli Zhou & Chao Tian Texas A&M University

Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques. Tianli Zhou & Chao Tian Texas A&M University Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques Tianli Zhou & Chao Tian Texas A&M University 2 Contents Motivation Background and Review Evaluating Individual

More information

On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice

On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice Yunfeng Zhu, Patrick P. C. Lee, Yuchong Hu, Liping Xiang, and Yinlong Xu University of Science and Technology

More information

R-Memcached: A Reliable In-Memory Cache for Big Key-Value Stores

R-Memcached: A Reliable In-Memory Cache for Big Key-Value Stores TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll03/13llpp560-573 Volume 20, Number 6, December 2015 R-Memcached: A Reliable In-Memory Cache for Big Key-Value Stores Chengjian Liu, Kai Ouyang, Xiaowen

More information

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University Outline Introduction and Motivation Our Design System and Implementation

More information

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS Prabodha Srimal Rodrigo Registration No. : 138230V Degree of Master of Science Department of Computer Science & Engineering University

More information

5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1

5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1 5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1 5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks Amdahl s law in Chapter 1 reminds us that

More information

Software-defined Storage: Fast, Safe and Efficient

Software-defined Storage: Fast, Safe and Efficient Software-defined Storage: Fast, Safe and Efficient TRY NOW Thanks to Blockchain and Intel Intelligent Storage Acceleration Library Every piece of data is required to be stored somewhere. We all know about

More information

On Data Parallelism of Erasure Coding in Distributed Storage Systems

On Data Parallelism of Erasure Coding in Distributed Storage Systems On Data Parallelism of Erasure Coding in Distributed Storage Systems Jun Li, Baochun Li Department of Electrical and Computer Engineering, University of Toronto, Canada {junli, bli}@ece.toronto.edu Abstract

More information

Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems

Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems arxiv:179.5365v2 [cs.dc] 19 Sep 217 Sungjoon Koh, Jie Zhang, Miryeong Kwon, Jungyeon

More information

Enabling Efficient and Reliable Transition from Replication to Erasure Coding for Clustered File Systems

Enabling Efficient and Reliable Transition from Replication to Erasure Coding for Clustered File Systems Enabling Efficient and Reliable Transition from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee Department of Computer Science and Engineering, The Chinese

More information

Finding a needle in Haystack: Facebook's photo storage

Finding a needle in Haystack: Facebook's photo storage Finding a needle in Haystack: Facebook's photo storage The paper is written at facebook and describes a object storage system called Haystack. Since facebook processes a lot of photos (20 petabytes total,

More information

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC INTRODUCTION With the EPYC processor line, AMD is expected to take a strong position in the server market including

More information

ActiveScale Erasure Coding and Self Protecting Technologies

ActiveScale Erasure Coding and Self Protecting Technologies WHITE PAPER AUGUST 2018 ActiveScale Erasure Coding and Self Protecting Technologies BitSpread Erasure Coding and BitDynamics Data Integrity and Repair Technologies within The ActiveScale Object Storage

More information

Modern Erasure Codes for Distributed Storage Systems

Modern Erasure Codes for Distributed Storage Systems Modern Erasure Codes for Distributed Storage Systems Storage Developer Conference, SNIA, Bangalore Srinivasan Narayanamurthy Advanced Technology Group, NetApp May 27 th 2016 1 Everything around us is changing!

More information

A Performance Evaluation of Open Source Erasure Codes for Storage Applications

A Performance Evaluation of Open Source Erasure Codes for Storage Applications A Performance Evaluation of Open Source Erasure Codes for Storage Applications James S. Plank Catherine D. Schuman (Tennessee) Jianqiang Luo Lihao Xu (Wayne State) Zooko Wilcox-O'Hearn Usenix FAST February

More information

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU

More information

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017 Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store Wei Xie TTU CS Department Seminar, 3/7/2017 1 Outline General introduction Study 1: Elastic Consistent Hashing based Store

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

Erasure coding and AONT algorithm selection for Secure Distributed Storage. Alem Abreha Sowmya Shetty

Erasure coding and AONT algorithm selection for Secure Distributed Storage. Alem Abreha Sowmya Shetty Erasure coding and AONT algorithm selection for Secure Distributed Storage Alem Abreha Sowmya Shetty Secure Distributed Storage AONT(All-Or-Nothing Transform) unkeyed transformation φ mapping a sequence

More information

Efficiency Considerations of Cauchy Reed-Solomon Implementations on Accelerator and Multi-Core Platforms

Efficiency Considerations of Cauchy Reed-Solomon Implementations on Accelerator and Multi-Core Platforms Efficiency Considerations of Cauchy Reed-Solomon Implementations on Accelerator and Multi-Core Platforms SAAHPC June 15 2010 Knoxville, TN Kathrin Peter Sebastian Borchert Thomas Steinke Zuse Institute

More information

Exploration of Erasure-Coded Storage Systems for High Performance, Reliability, and Inter-operability

Exploration of Erasure-Coded Storage Systems for High Performance, Reliability, and Inter-operability Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2016 Exploration of Erasure-Coded Storage Systems for High Performance, Reliability, and Inter-operability

More information

SRCMap: Energy Proportional Storage using Dynamic Consolidation

SRCMap: Energy Proportional Storage using Dynamic Consolidation SRCMap: Energy Proportional Storage using Dynamic Consolidation By: Akshat Verma, Ricardo Koller, Luis Useche, Raju Rangaswami Presented by: James Larkby-Lahet Motivation storage consumes 10-25% of datacenter

More information

Modern Erasure Codes for Distributed Storage Systems

Modern Erasure Codes for Distributed Storage Systems Modern Erasure Codes for Distributed Storage Systems Srinivasan Narayanamurthy (Srini) NetApp Everything around us is changing! r The Data Deluge r Disk capacities and densities are increasing faster than

More information

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload. Dror Goldenberg VP Software Architecture Mellanox Technologies

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload. Dror Goldenberg VP Software Architecture Mellanox Technologies Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Dror Goldenberg VP Software Architecture Mellanox Technologies SNIA Legal Notice The material contained in this tutorial is

More information

ActiveScale Erasure Coding and Self Protecting Technologies

ActiveScale Erasure Coding and Self Protecting Technologies NOVEMBER 2017 ActiveScale Erasure Coding and Self Protecting Technologies BitSpread Erasure Coding and BitDynamics Data Integrity and Repair Technologies within The ActiveScale Object Storage System Software

More information

File systems CS 241. May 2, University of Illinois

File systems CS 241. May 2, University of Illinois File systems CS 241 May 2, 2014 University of Illinois 1 Announcements Finals approaching, know your times and conflicts Ours: Friday May 16, 8-11 am Inform us by Wed May 7 if you have to take a conflict

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions. James S. Plank USENIX FAST. University of Tennessee

Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions. James S. Plank USENIX FAST. University of Tennessee Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions James S. Plank University of Tennessee USENIX FAST San Jose, CA February 15, 2013. Authors Jim Plank Tennessee Kevin Greenan EMC/Data

More information

Parallel Approach for Implementing Data Mining Algorithms

Parallel Approach for Implementing Data Mining Algorithms TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Copysets: Reducing the Frequency of Data Loss in Cloud Storage

Copysets: Reducing the Frequency of Data Loss in Cloud Storage Copysets: Reducing the Frequency of Data Loss in Cloud Storage Asaf Cidon, Stephen M. Rumble, Ryan Stutsman, Sachin Katti, John Ousterhout and Mendel Rosenblum Stanford University cidon@stanford.edu, {rumble,stutsman,skatti,ouster,mendel}@cs.stanford.edu

More information

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID System Upgrade Teaches RAID In the growing computer industry we often find it difficult to keep track of the everyday changes in technology. At System Upgrade, Inc it is our goal and mission to provide

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

HKBU Institutional Repository

HKBU Institutional Repository Hong Kong Baptist University HKBU Institutional Repository Open Access Theses and Dissertations Electronic Theses and Dissertations 8-29-2016 Energy conservation techniques for GPU computing Xinxin Mei

More information

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed

More information

Isilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team

Isilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team Isilon: Raising The Bar On Performance & Archive Use Cases John Har Solutions Product Manager Unstructured Data Storage Team What we ll cover in this session Isilon Overview Streaming workflows High ops/s

More information

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

Efficient Lists Intersection by CPU- GPU Cooperative Computing

Efficient Lists Intersection by CPU- GPU Cooperative Computing Efficient Lists Intersection by CPU- GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab, Nankai University Outline Introduction Cooperative

More information

Decentralized Distributed Storage System for Big Data

Decentralized Distributed Storage System for Big Data Decentralized Distributed Storage System for Big Presenter: Wei Xie -Intensive Scalable Computing Laboratory(DISCL) Computer Science Department Texas Tech University Outline Trends in Big and Cloud Storage

More information

RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store

RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store Yiming Zhang, Rui Chu @ NUDT Chuanxiong Guo, Guohan Lu, Yongqiang Xiong, Haitao Wu @ MSRA June, 2012 1 Background Disk-based storage

More information

Reducing The De-linearization of Data Placement to Improve Deduplication Performance

Reducing The De-linearization of Data Placement to Improve Deduplication Performance Reducing The De-linearization of Data Placement to Improve Deduplication Performance Yujuan Tan 1, Zhichao Yan 2, Dan Feng 2, E. H.-M. Sha 1,3 1 School of Computer Science & Technology, Chongqing University

More information

Performance of relational database management

Performance of relational database management Building a 3-D DRAM Architecture for Optimum Cost/Performance By Gene Bowles and Duke Lambert As systems increase in performance and power, magnetic disk storage speeds have lagged behind. But using solidstate

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Ambry: LinkedIn s Scalable Geo- Distributed Object Store Ambry: LinkedIn s Scalable Geo- Distributed Object Store Shadi A. Noghabi *, Sriram Subramanian +, Priyesh Narayanan +, Sivabalan Narayanan +, Gopalakrishna Holla +, Mammad Zadeh +, Tianwei Li +, Indranil

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

VMware vsphere Clusters in Security Zones

VMware vsphere Clusters in Security Zones SOLUTION OVERVIEW VMware vsan VMware vsphere Clusters in Security Zones A security zone, also referred to as a DMZ," is a sub-network that is designed to provide tightly controlled connectivity to an organization

More information

Dense matching GPU implementation

Dense matching GPU implementation Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 Advance Encryption Standard (AES) Rijndael algorithm is symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256

More information

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage Silverton Consulting, Inc. StorInt Briefing 2017 SILVERTON CONSULTING, INC. ALL RIGHTS RESERVED Page 2 Introduction Unstructured data has

More information

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Design Tradeoffs for Data Deduplication Performance in Backup Workloads Design Tradeoffs for Data Deduplication Performance in Backup Workloads Min Fu,DanFeng,YuHua,XubinHe, Zuoning Chen *, Wen Xia,YuchengZhang,YujuanTan Huazhong University of Science and Technology Virginia

More information

Data Center Performance

Data Center Performance Data Center Performance George Porter CSE 124 Feb 15, 2017 *Includes material taken from Barroso et al., 2013, UCSD 222a, and Cedric Lam and Hong Liu (Google) Part 1: Partitioning work across many servers

More information

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu Database Architecture 2 & Storage Instructor: Matei Zaharia cs245.stanford.edu Summary from Last Time System R mostly matched the architecture of a modern RDBMS» SQL» Many storage & access methods» Cost-based

More information

SolidFire and Pure Storage Architectural Comparison

SolidFire and Pure Storage Architectural Comparison The All-Flash Array Built for the Next Generation Data Center SolidFire and Pure Storage Architectural Comparison June 2014 This document includes general information about Pure Storage architecture as

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

vsan Security Zone Deployment First Published On: Last Updated On:

vsan Security Zone Deployment First Published On: Last Updated On: First Published On: 06-14-2017 Last Updated On: 11-20-2017 1 1. vsan Security Zone Deployment 1.1.Solution Overview Table of Contents 2 1. vsan Security Zone Deployment 3 1.1 Solution Overview VMware vsphere

More information

D E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager

D E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager 1 T HE D E N A L I N E X T - G E N E R A T I O N H I G H - D E N S I T Y S T O R A G E I N T E R F A C E Laura Caulfield Senior Software Engineer Arie van der Hoeven Principal Program Manager Outline Technology

More information

CONFIGURATION GUIDE WHITE PAPER JULY ActiveScale. Family Configuration Guide

CONFIGURATION GUIDE WHITE PAPER JULY ActiveScale. Family Configuration Guide WHITE PAPER JULY 2018 ActiveScale Family Configuration Guide Introduction The world is awash in a sea of data. Unstructured data from our mobile devices, emails, social media, clickstreams, log files,

More information

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj DiskReduce: Making Room for More Data on DISCs Wittawat Tantisiriroj Lin Xiao, Bin Fan, and Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University GFS/HDFS Triplication GFS & HDFS triplicate

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Survey: Users Share Their Storage Performance Needs. Jim Handy, Objective Analysis Thomas Coughlin, PhD, Coughlin Associates

Survey: Users Share Their Storage Performance Needs. Jim Handy, Objective Analysis Thomas Coughlin, PhD, Coughlin Associates Survey: Users Share Their Storage Performance Needs Jim Handy, Objective Analysis Thomas Coughlin, PhD, Coughlin Associates Table of Contents The Problem... 1 Application Classes... 1 IOPS Needs... 2 Capacity

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-level Transparent Local Erasure Coding

Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-level Transparent Local Erasure Coding Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-level Transparent Local Erasure Coding Yin Li, Hao Wang, Xuebin Zhang, Ning Zheng, Shafa Dahandeh,

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data

More information

Nutanix Tech Note. Virtualizing Microsoft Applications on Web-Scale Infrastructure

Nutanix Tech Note. Virtualizing Microsoft Applications on Web-Scale Infrastructure Nutanix Tech Note Virtualizing Microsoft Applications on Web-Scale Infrastructure The increase in virtualization of critical applications has brought significant attention to compute and storage infrastructure.

More information

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil Parallel Genome-Wide Analysis With Central And Graphic Processing Units Muhamad Fitra Kacamarga mkacamarga@binus.edu James W. Baurley baurley@binus.edu Bens Pardamean bpardamean@binus.edu Abstract The

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

HTRC Data API Performance Study

HTRC Data API Performance Study HTRC Data API Performance Study Yiming Sun, Beth Plale, Jiaan Zeng Amazon Indiana University Bloomington {plale, jiaazeng}@cs.indiana.edu Abstract HathiTrust Research Center (HTRC) allows users to access

More information

The Performance Analysis of a Service Deployment System Based on the Centralized Storage

The Performance Analysis of a Service Deployment System Based on the Centralized Storage The Performance Analysis of a Service Deployment System Based on the Centralized Storage Zhu Xu Dong School of Computer Science and Information Engineering Zhejiang Gongshang University 310018 Hangzhou,

More information

New Approach to Unstructured Data

New Approach to Unstructured Data Innovations in All-Flash Storage Deliver a New Approach to Unstructured Data Table of Contents Developing a new approach to unstructured data...2 Designing a new storage architecture...2 Understanding

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

CSE 124: Networked Services Lecture-17

CSE 124: Networked Services Lecture-17 Fall 2010 CSE 124: Networked Services Lecture-17 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/30/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

Google File System. By Dinesh Amatya

Google File System. By Dinesh Amatya Google File System By Dinesh Amatya Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung designed and implemented to meet rapidly growing demand of Google's data processing need a scalable

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Mining Distributed Frequent Itemset with Hadoop

Mining Distributed Frequent Itemset with Hadoop Mining Distributed Frequent Itemset with Hadoop Ms. Poonam Modgi, PG student, Parul Institute of Technology, GTU. Prof. Dinesh Vaghela, Parul Institute of Technology, GTU. Abstract: In the current scenario

More information

EMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE

EMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE White Paper EMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE EMC XtremSF, EMC XtremCache, EMC Symmetrix VMAX and Symmetrix VMAX 10K, XtremSF and XtremCache dramatically improve Oracle performance Symmetrix

More information

Strategic Briefing Paper Big Data

Strategic Briefing Paper Big Data Strategic Briefing Paper Big Data The promise of Big Data is improved competitiveness, reduced cost and minimized risk by taking better decisions. This requires affordable solution architectures which

More information

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data Xingbo Wu Yuehai Xu Song Jiang Zili Shao The Hong Kong Polytechnic University The Challenge on Today s Key-Value Store Trends on workloads

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

Advanced Database Systems

Advanced Database Systems Lecture II Storage Layer Kyumars Sheykh Esmaili Course s Syllabus Core Topics Storage Layer Query Processing and Optimization Transaction Management and Recovery Advanced Topics Cloud Computing and Web

More information

Isilon Performance. Name

Isilon Performance. Name 1 Isilon Performance Name 2 Agenda Architecture Overview Next Generation Hardware Performance Caching Performance Streaming Reads Performance Tuning OneFS Architecture Overview Copyright 2014 EMC Corporation.

More information

IBM Spectrum Scale IO performance

IBM Spectrum Scale IO performance IBM Spectrum Scale 5.0.0 IO performance Silverton Consulting, Inc. StorInt Briefing 2 Introduction High-performance computing (HPC) and scientific computing are in a constant state of transition. Artificial

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

Open vstorage RedHat Ceph Architectural Comparison

Open vstorage RedHat Ceph Architectural Comparison Open vstorage RedHat Ceph Architectural Comparison Open vstorage is the World s fastest Distributed Block Store that spans across different Datacenter. It combines ultrahigh performance and low latency

More information

RAID6L: A Log-Assisted RAID6 Storage Architecture with Improved Write Performance

RAID6L: A Log-Assisted RAID6 Storage Architecture with Improved Write Performance RAID6L: A Log-Assisted RAID6 Storage Architecture with Improved Write Performance Chao Jin, Dan Feng, Hong Jiang, Lei Tian School of Computer, Huazhong University of Science and Technology Wuhan National

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-16 Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

Enhanced Web Log Based Recommendation by Personalized Retrieval

Enhanced Web Log Based Recommendation by Personalized Retrieval Enhanced Web Log Based Recommendation by Personalized Retrieval Xueping Peng FACULTY OF ENGINEERING AND INFORMATION TECHNOLOGY UNIVERSITY OF TECHNOLOGY, SYDNEY A thesis submitted for the degree of Doctor

More information

The Fusion Distributed File System

The Fusion Distributed File System Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique

More information