Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Similar documents
Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Speeding Up Cloud/Server Applications Using Flash Memory

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

Reducing The De-linearization of Data Placement to Improve Deduplication Performance

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

Deploying De-Duplication on Ext4 File System

Alternative Approaches for Deduplication in Cloud Storage Environment

HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud

Chapter 14 HARD: Host-Level Address Remapping Driver for Solid-State Disk

Rethinking Deduplication Scalability

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD

DEC: An Efficient Deduplication-Enhanced Compression Approach

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Data Reduction Meets Reality What to Expect From Data Reduction

Parallel Processing for Data Deduplication

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

Deduplication Storage System

New HPE 3PAR StoreServ 8000 and series Optimized for Flash

Adaptation of Distributed File System to VDI Storage by Client-Side Cache

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Byte Index Chunking Approach for Data Compression

SFS: Random Write Considered Harmful in Solid State Drives

Delta Compressed and Deduplicated Storage Using Stream-Informed Locality

ENCRYPTED DATA MANAGEMENT WITH DEDUPLICATION IN CLOUD COMPUTING

P-Dedupe: Exploiting Parallelism in Data Deduplication System

Compression and Decompression of Virtual Disk Using Deduplication

Data deduplication for Similar Files

Flashed-Optimized VPSA. Always Aligned with your Changing World

Main Memory (Part II)

Sparse Indexing: Large-Scale, Inline Deduplication Using Sampling and Locality

Baoping Wang School of software, Nanyang Normal University, Nanyang , Henan, China

SoftNAS Cloud Performance Evaluation on Microsoft Azure

Presented by: Nafiseh Mahmoudi Spring 2017

Drive Space Efficiency Using the Deduplication/Compression Function of the FUJITSU Storage ETERNUS AF series and ETERNUS DX S4/S3 series

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD)

UCS Invicta: A New Generation of Storage Performance. Mazen Abou Najm DC Consulting Systems Engineer

Optimizing Flash-based Key-value Cache Systems

GPUfs: Integrating a file system with GPUs

SoftNAS Cloud Performance Evaluation on AWS

Deduplication File System & Course Review

Caching and Buffering in HDF5

LevelDB-Raw: Eliminating File System Overhead for Optimizing Performance of LevelDB Engine

Lazy Exact Deduplication

Online Version Only. Book made by this file is ILLEGAL. Design and Implementation of Binary File Similarity Evaluation System. 1.

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

Operating System Supports for SCM as Main Memory Systems (Focusing on ibuddy)

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

Understanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp.

NetApp Data Compression, Deduplication, and Data Compaction

An Analysis on Empirical Performance of SSD-Based RAID

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

PowerVault MD3 SSD Cache Overview

Nowadays data-intensive applications play a

Multi-level Byte Index Chunking Mechanism for File Synchronization

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

JOURNALING techniques have been widely used in modern

LETTER Solid-State Disk with Double Data Rate DRAM Interface for High-Performance PCs

I-CASH: Intelligently Coupled Array of SSD and HDD

Parallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Design and Implementation of a Random Access File System for NVRAM

vsan 6.6 Performance Improvements First Published On: Last Updated On:

CPU-GPU hybrid computing for feature extraction from video stream

Reducing Solid-State Storage Device Write Stress Through Opportunistic In-Place Delta Compression

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

An Efficient Snapshot Technique for Ext3 File System in Linux 2.6

Enhance Data De-Duplication Performance With Multi-Thread Chunking Algorithm. December 9, Xinran Jiang, Jia Zhao, Jie Zheng

Real-time processing for intelligent-surveillance applications

I/O Buffering and Streaming

FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance

Parallel graph traversal for FPGA

Improving Throughput in Cloud Storage System

Next-Generation Cloud Platform

Towards Breast Anatomy Simulation Using GPUs

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

White Paper Features and Benefits of Fujitsu All-Flash Arrays for Virtualization and Consolidation ETERNUS AF S2 series

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

SolidFire and Ceph Architectural Comparison

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

CS3600 SYSTEMS AND NETWORKS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage

SLM-DB: Single-Level Key-Value Store with Persistent Memory

Software and Tools for HPE s The Machine Project

SmartMD: A High Performance Deduplication Engine with Mixed Pages

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Isilon Performance. Name

DUE to the explosive growth of the digital data, data

WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Using Transparent Compression to Improve SSD-based I/O Caches

FLASHARRAY//M Business and IT Transformation in 3U

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Chapter 9 Memory Management

Transcription:

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr Abstract. Data reduction operations such as deduplication and compression are widely used to save storage capacity in primary storage system. These operations are compute-intensive. High performance storage devices like SSDs are widely used in most primary storage systems. Therefore, data reduction operations become a performance bottleneck in SSD-based primary storage systems. In this paper, we propose a parallel data reduction technique on data deduplication and compression utilizing both multi-core CPU and GPU in an integrated manner. First, we introduce bin-based data deduplication, a parallel technique on deduplication, where CPU-based parallelism is mainly applied whereas GPU is utilized as co-processor of CPU. Second, we also propose a parallel technique on compression, where main computation is done by GPU while CPU is responsible only for post-processing. Third, we propose a parallel technique handling both deduplication and compression in an integrated manner, where our technique controls when and how to use GPU. Experimental evaluation shows that our proposed techniques can achieve 15.0%, 88.3%, and 89.7% better throughput than the case where only CPU is applied for deduplication, compression, and integrated data reductions, respectively. Our proposed technique enables easy application of data reduction operations to SSD-based primary storage systems. Keywords: Primary storage Inline data reduction scheme GPU 1 Introduction Data reduction operations such as data de-duplication and compression are widely used to save storage capacity on primary storage systems. In recent years, however, replacing primary storage systems from HDD-based to SSD-based has exposed the computational overhead of data reduction operations, making it difficult to apply data reduction operations to storage systems. One way to conceal the overhead of data reduction operations is to store all of the data on the storage system and then perform data reduction in the background when the system is idle. However, this generates more write I/O than systems without the data reduction operations. Therefore, it is not applicable to SSDbased storage systems due to write endurance problems. A way to increase the lifetime of SSD-based storage systems is to apply data reduction operations to the critical I/O paths. However, applying them to the critical I/O paths can significantly degrade I/O performance. One way to improve the throughput of data reduction is to take advantage Springer International Publishing AG 2017 V. Malyshkin (Ed.): PaCT 2017, LNCS 10421, pp. 301 307, 2017. DOI: 10.1007/978-3-319-62932-2_29

302 J. Ma and C. Park of GPUs designed to calculate computation-intensive workloads. However, depending on the workload, the performance of the CPU-based parallel data reduction operations may be better than GPU-based techniques. In this paper, we propose an inline parallel data reduction operations based on multicore CPU and GPU for primary storage systems. To do this, we design a parallel deduplication and compression method considering multi-core CPU and GPU architecture, and finally we show how to integrate CPU and GPU-based data reduction operations. 2 Background Data reduction operations such as deduplication and compression are widely used to save storage capacity on primary storage systems. This section describes the basic tasks of data reduction operations and the performance bottlenecks. Deduplication is performed in four stages: chunking, hashing, indexing, and destaging. Chunking is the process of breaking a data stream into chunks, which is the base unit for checking the redundancy of data. Hashing is the process of calculating the hash value of each chunk. The hash value is used as an identifier for the chunk. Indexing is the process of comparing the hash value of each chunk with the hash values of already stored chunks to determine whether it is a duplicate. If the chunk is found to be unique, a destaging step is performed to store the chunk on the storage device. Of these stages, hashing and indexing are the main performance bottlenecks in deduplication systems. Previous work [1] has also attempted to address these two major performance bottlenecks. Among the compression algorithms, LZ-based compression algorithms are widely used in main storage systems due to their simplicity and effectiveness [2]. The history buffer and the look-ahead buffer are used to perform LZ compression. If characters in the same order are found in both the history buffer and the look-ahead buffer, the character sequence in the look-ahead buffer is replaced by a pointer to the character sequence in the history buffer. Matching the entire string is a performance bottleneck. 3 Design and Implementation 3.1 Parallel Data Deduplication on Multi-core CPU and GPU There is no data dependency between chunks when the hash value of the chunk is calculated in the hashing phase. This allows us to easily calculate multiple chunks at once in a natural parallel manner. However, parallelizing the indexing is more complicated than the hashing. This is because the hash table used to determine the chunk s redundancy is globally shared across all computing threads. Therefore, this section describes how indexing is parallelized on the multi-core CPU and GPU, and how it applies to the primary storage system. (1) How to Parallelize Indexing on the CPU: we divide the hash table into several small hash tables called bin so that multiple computing threads can check the chunks of multiple hash tables at the same time without locking mechanism. This is a technique

Parallelizing Inline Data Reduction Operations 303 that was commonly used in existing DHT-based systems. We call this operation binbased indexing. In addition, to avoid disk access that significantly degrades performance, hash table entries are kept in memory space only, not disk space. Due to this index management policy, the deduplication module cannot find some duplicate data. However that is not a big deal. Assuming that the storage capacity is 4 TB, the chunk size is 8 KB, and the index size is 32 bytes, including the hash size (SHA1, 20 bytes) and other metadata, the storage system requires 16 GB of memory for the index. That is, if primary storage is the target, it does not require that much memory. In addition, the way to reduce memory consumption is to remove the prefix value of the hash entry. If the prefix value is n bytes, the deduplication system keeps only 20-n bytes for each hash value. If the storage system uses a 2-byte prefix value, we can save 1 GB of memory in this way. (2) How to Parallelize Indexing on the GPU: parallel processing of GPU indexing needs to take into account the architectural characteristics of the GPU. First, the GPU is connected to the system memory via the PCI interface, and the data used for the calculation must be transferred from the system memory to the GPU device memory. Second, GPU threads in the same workgroup run the same command regardless of branching, even though each thread has its own execution path. Therefore, many branch operations can degrade computational performance. This means we have to design the GPU code in a rather simple way. Third, GPUs have many computing cores and large memory bandwidth. Therefore, we can calculate large amounts of data at a time. This means that allocating data to all computing cores and setting up data layouts is critical to taking full advantage of all GPU resources. The GPU also performs bin-based indexing just like on a CPU. However, considering the characteristics of the advanced GPU architecture, we organize one bin into a linear table structure rather than a tree structure. This continuous data layout is useful when utilizing the GPU s local memory. This is because copying data from GPU global memory to local memory can be done naturally if the thread accesses the data continuously. It also does not cause multiple branch operations. The GPU can check the redundancy of data by comparing a single hash table. Also, only the hash value persists in GPU memory, and other metadata in the chunk is maintained in system memory. This is because transferring data can be a direct update process. This means that there is no other hash table update overhead on the GPU. Therefore, the result of whether an index is hit or not includes an index number and a hit/miss information pair. The metadata space structure in system memory then uses the results of the GPU. (3) When to use GPU for indexing: we decide how to apply GPU for indexing. To do this, we compare the CPU and GPU indexing performance. The number of hash table entries used for indexing remains the same on the CPU and GPU for a fair comparison. Preliminary experiments show that CPU performance is 4.16 to 5.45 times better than GPU performance in terms of execution time. For GPU indexing, the execution time is fixed because of the inevitable time at which the GPU kernel starts. This means that even with high-performance GPUs, there is a limit to optimizing indexing on the GPU. Therefore, we decide to use GPU only when CPU utilization is full and there is still some work to do for indexing.

304 J. Ma and C. Park 3.2 Parallel Data Compression on Multi-core CPU and GPU In this section, we focus on the way to parallelize LZ compression schemes that are commonly used in primary storage systems. (1) How to parallelize compression for CPU: As with hashing operations, there is no data dependency between chunks, so we can run compression independently on each chunk. CPU-based compression algorithms have been well studied previously. Therefore, the compute is parallelized by the CPU by assigning a computing thread that runs the previously studied compression algorithm to each chunk. (2) How to parallelize compression for GPU: Ozsoy et al. [3] introduced a parallel compression algorithm on the GPU. This algorithm divides the data into several sub-blocks and calculates the compression result in each sub-block and merges it in the CPU. This algorithm has a weakness to apply as a compression algorithm for primary storage systems. This algorithm assumes that the size of the data to be compressed is large enough to take full advantage of the GPU resources. This means that it does not work well for small-sized target data. The size of the chunk is 4 KB. Only a small number of computing cores can be allocated to compute the compression result of 4 KB chunks. Therefore, we design a compression algorithm that computes the chunk compression results at a time. The GPU allocates multiple threads for each chunk. Each stage performs its own LZ compression algorithm with its own history buffer and look-ahead buffer. Adjacent threads inspect overlapping regions by the size of the history buffer. The GPU s compression results are not refined in GPU due to performance issues. Therefore, the CPU must refine the results. It is called as post-processing. (3) How to use the GPU for compression: we compare the compression performance of the CPU and the GPU to determine when to use the GPU. Experimental results show that GPU performance is 88.3% better than CPU performance in terms of execution time (In Sect. 4). The performance gap is large. Therefore, the GPU performs compression and the CPU is used for refinement. 3.3 Putting It All Together This section describes how to incorporate two parallel data reduction operations called deduplication and compression. First, we need to determine the order of which operation should be applied. Based on the result of [5], we adopt deduplication-before-compression order for higher data reduction ratio. Second, we add a bin buffer structure to the data deduplication algorithm. The bin buffer is used to temporarily store a hash for each bin before moving each bin to the GPU memory and bin tree. When the buffer is full, the hash is immediately flushed from the buffer to the storage. This creates the appropriate sequential writes for the SSD. Figure 1 shows a workflow that incorporates deduplication and compression operations on the CPU and GPU. GPU indexing is performed if the GPU is available, and CPU indexing is performed if duplicate hashes are not found. For the CPU indexing path, the bin buffer is checked first, because recently updated chunks can reside in the bin buffer and chunks are more likely to find duplicates in the bin buffer due to temporal locality. If there are no duplicates in the bin buffer, check the

Parallelizing Inline Data Reduction Operations 305 bin tree to store most of the hash table entries. If we cannot find any duplicate, then the chunk is regarded a unique chunk. Therefore, the chunk becomes the compression target. After compressing the data, the bin buffer is updated because the chunks are unique. If the bin buffer becomes full, the buffer will be flushed to the storage. And then, GPU bin in GPU memory are updated accordingly. Currently, random based replacement policy is applied. Fig. 1. An integrated workflow of deduplication and compression proposed for data reduction operations 4 Evaluation This section evaluates the throughput of the parallel data reduction operations on the CPU and GPU. The vdbench is used to generate datasets. Our test machine equipped with Intel i7-3770 k, Radeon HD 7970, and 16 GB main memory. The vdbench [4] is used to generate the dataset. The size of the data stream is about 2 GB. The deduplication and compression ratio are set to 2.0, which is a common ratio for primary storage systems. We compare our schemes with the throughput of Samsung SSD 830. In this section, the Samsung SSD 830 is simply referred to as the SSD. (1) Parallel data deduplication: the GPU performs indexing of only a small portion of the chunk. The workflow for the integrated CPU and GPU for indexing is the same as in Fig. 1, except for the compression phase. Experimental results show that the GPU-supported data deduplication scheme can improve throughput by 15% over CPU-only data deduplication scheme. In addition, it shows three times the throughput of the SSD. (2) Parallel data compression: the proposed technique uses the GPU for compression and the CPU for post-processing of compression. Due to the nature of the compression technique, the throughput is high when the compression ratio is high. The CPUbased compression method has lower performance (about 50 K IOPS) than SSD throughput (about 80 K IOPS) when the compression ratio is low, but the GPUbased parallel compression method has the performance of 100 K IOPS even when the compression ratio is low. It always shows higher performance than SSD throughput. (3) Putting it all together - Parallelizing both data deduplication and compression together: In an environment where CPU and GPU are available, there are several

306 J. Ma and C. Park options for integrating two data reduction operations, deduplication and compression. The first option is to use the GPU in two data reduction operations. The second option is to use the GPU for only one data reduction operation. The last option is that both data reduction operations do not use the GPU at all. The last option may be useful when the performance of the GPU is poor. Figure 2 shows the throughput of these options. Fig. 2. Throughput comparison of integration methods Allocating the GPU for compression is the best choice among the integration methods. This is because data compression, which has a high performance gain when using a GPU, monopolizes the GPU. However, because hardware specifications may be different on different platforms, we cannot guarantee that this integration is always right. Therefore, before assigning processors to each data reduction operation, the performance of these integration methods is compared using dummy I/O to determine the best fit for throughput. Therefore, we can ensure the best performance even if the target platform is different. 5 Related Works There have been lots of previous researches which investigated the way to improve the throughput of data reduction operations. There exist some researches exploiting parallelism in data deduplication system. Xia W. et al. [6] proposed multicore-based parallel data deduplication approach. However, the problem is that they did not consider the operation of indexing which is known as main bottleneck in data deduplication [1]. Kim et al. [7] proposed GPUbased data deduplication approach for the primary storage. However, they did not consider utilizing CPU that performs better than GPU for indexing operation. There exist some researches exploiting GPU parallelism for compression operation. Ozsoy et al. [3] introduced the parallel compression algorithms on GPU. However, the compression target data are quite large to utilize GPU resource fully. This feature does not match with primary storage system that conducts compression for 4 KB of several chunks. Moreover, there exist researches introducing CPU parallel algorithms for compression. Shmuel et al. [8] introduce the algorithm for the compression executed

Parallelizing Inline Data Reduction Operations 307 using the tree-structured hierarchy. Gonzalo et al. [9] introduce the algorithm dividing data stream into several small subset and allocating each threads to the subset of data. Even they parallelize the compression for CPU, our GPU-based approach is better than at least about 88.3%. There exists a research analyzing the effect of mixing two data reduction operations, deduplication and compression. Constantinescu et al. [5] analyze the data reduction ratio when deduplication and compression are applied together. However, it focuses only data reduction ratio, not throughput. 6 Conclusion Throughput is becoming more important as data reduction operations are applied to save space on SSD-based primary storage systems. To solve this problem, we proposed parallel data reduction operations using multi-core CPU and GPU. We also showed how to integrate deduplication and compression technologies on multicore CPUs and GPUs. Applying our parallel approach to deduplication is 3 times better than SSD s throughput. For compression, the throughput of the parallel compression method supported by the GPU is 88.3% better than the average throughput of parallel QuickLZ. Finally, GPUsupported integration shows a performance improvement of 89.7% over parallel data reduction operations using CPU (deduplication ratio 2.0, compression 2.0). This means that our proposed technique enables easy application of data reduction operations to SSD-based primary storage systems. References 1. Guo, F., Efstathopoulos, P.: Building a high-performance deduplication system: In: USENIX Annual Technical Conference (2011) 2. De Agostino, S.: Lempel-Ziv data compression on parallel and distributed systems. Algorithms 4, 183 199 (2011) 3. Ozsoy, A., Swany, M., Chauhan, A.: Pipelined parallel LZSS for streaming data compression on GPGPUs. In: Parallel and Distributed Systems, pp. 37 44 (2012) 4. Berryman, A., Calyam, P., Honigford, M., Lai, A.M.: Vdbench: a benchmarking toolkit for thin-client based virtual desktop environments. In: Cloud Computing Technology and Science, pp. 480 487 (2010) 5. Constantinescu, C., Glider, J., Chambliss, D.: Mixing deduplication and compression on active data sets: In: Data Compression Conference, pp. 393 402 (2011) 6. Xia, W., Jiang, H., Feng, D., Tian, L., Fu, M., Wang, Z.: P-dedupe: exploiting parallelism in data deduplication system: In: Networking, Architecture and Storage, pp. 338 347 (2012) 7. Kim, C., Park, K.W., Park, K.H.: GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system. In: Proceedings of the International Workshop on Programming Models and Applications for Multicores and Manycores, pp. 17 26 (2012) 8. Klein, S.T., Wiseman, Y.: Parallel Lempel Ziv coding (extended abstract). In: Amir, A. (ed.) CPM 2001. LNCS, vol. 2089, pp. 18 30. Springer, Heidelberg (2001). doi: 10.1007/3-540-48194-X_2 9. Navarro, G., Raffinot, M.: Practical and flexible pattern matching over Ziv-Lempel compressed text. J. Discrete Algorithms 2, 347 371 (2004)