Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr Abstract. Data reduction operations such as deduplication and compression are widely used to save storage capacity in primary storage system. These operations are compute-intensive. High performance storage devices like SSDs are widely used in most primary storage systems. Therefore, data reduction operations become a performance bottleneck in SSD-based primary storage systems. In this paper, we propose a parallel data reduction technique on data deduplication and compression utilizing both multi-core CPU and GPU in an integrated manner. First, we introduce bin-based data deduplication, a parallel technique on deduplication, where CPU-based parallelism is mainly applied whereas GPU is utilized as co-processor of CPU. Second, we also propose a parallel technique on compression, where main computation is done by GPU while CPU is responsible only for post-processing. Third, we propose a parallel technique handling both deduplication and compression in an integrated manner, where our technique controls when and how to use GPU. Experimental evaluation shows that our proposed techniques can achieve 15.0%, 88.3%, and 89.7% better throughput than the case where only CPU is applied for deduplication, compression, and integrated data reductions, respectively. Our proposed technique enables easy application of data reduction operations to SSD-based primary storage systems. Keywords: Primary storage Inline data reduction scheme GPU 1 Introduction Data reduction operations such as data de-duplication and compression are widely used to save storage capacity on primary storage systems. In recent years, however, replacing primary storage systems from HDD-based to SSD-based has exposed the computational overhead of data reduction operations, making it difficult to apply data reduction operations to storage systems. One way to conceal the overhead of data reduction operations is to store all of the data on the storage system and then perform data reduction in the background when the system is idle. However, this generates more write I/O than systems without the data reduction operations. Therefore, it is not applicable to SSDbased storage systems due to write endurance problems. A way to increase the lifetime of SSD-based storage systems is to apply data reduction operations to the critical I/O paths. However, applying them to the critical I/O paths can significantly degrade I/O performance. One way to improve the throughput of data reduction is to take advantage Springer International Publishing AG 2017 V. Malyshkin (Ed.): PaCT 2017, LNCS 10421, pp. 301 307, 2017. DOI: 10.1007/978-3-319-62932-2_29

302 J. Ma and C. Park of GPUs designed to calculate computation-intensive workloads. However, depending on the workload, the performance of the CPU-based parallel data reduction operations may be better than GPU-based techniques. In this paper, we propose an inline parallel data reduction operations based on multicore CPU and GPU for primary storage systems. To do this, we design a parallel deduplication and compression method considering multi-core CPU and GPU architecture, and finally we show how to integrate CPU and GPU-based data reduction operations. 2 Background Data reduction operations such as deduplication and compression are widely used to save storage capacity on primary storage systems. This section describes the basic tasks of data reduction operations and the performance bottlenecks. Deduplication is performed in four stages: chunking, hashing, indexing, and destaging. Chunking is the process of breaking a data stream into chunks, which is the base unit for checking the redundancy of data. Hashing is the process of calculating the hash value of each chunk. The hash value is used as an identifier for the chunk. Indexing is the process of comparing the hash value of each chunk with the hash values of already stored chunks to determine whether it is a duplicate. If the chunk is found to be unique, a destaging step is performed to store the chunk on the storage device. Of these stages, hashing and indexing are the main performance bottlenecks in deduplication systems. Previous work [1] has also attempted to address these two major performance bottlenecks. Among the compression algorithms, LZ-based compression algorithms are widely used in main storage systems due to their simplicity and effectiveness [2]. The history buffer and the look-ahead buffer are used to perform LZ compression. If characters in the same order are found in both the history buffer and the look-ahead buffer, the character sequence in the look-ahead buffer is replaced by a pointer to the character sequence in the history buffer. Matching the entire string is a performance bottleneck. 3 Design and Implementation 3.1 Parallel Data Deduplication on Multi-core CPU and GPU There is no data dependency between chunks when the hash value of the chunk is calculated in the hashing phase. This allows us to easily calculate multiple chunks at once in a natural parallel manner. However, parallelizing the indexing is more complicated than the hashing. This is because the hash table used to determine the chunk s redundancy is globally shared across all computing threads. Therefore, this section describes how indexing is parallelized on the multi-core CPU and GPU, and how it applies to the primary storage system. (1) How to Parallelize Indexing on the CPU: we divide the hash table into several small hash tables called bin so that multiple computing threads can check the chunks of multiple hash tables at the same time without locking mechanism. This is a technique

Parallelizing Inline Data Reduction Operations 303 that was commonly used in existing DHT-based systems. We call this operation binbased indexing. In addition, to avoid disk access that significantly degrades performance, hash table entries are kept in memory space only, not disk space. Due to this index management policy, the deduplication module cannot find some duplicate data. However that is not a big deal. Assuming that the storage capacity is 4 TB, the chunk size is 8 KB, and the index size is 32 bytes, including the hash size (SHA1, 20 bytes) and other metadata, the storage system requires 16 GB of memory for the index. That is, if primary storage is the target, it does not require that much memory. In addition, the way to reduce memory consumption is to remove the prefix value of the hash entry. If the prefix value is n bytes, the deduplication system keeps only 20-n bytes for each hash value. If the storage system uses a 2-byte prefix value, we can save 1 GB of memory in this way. (2) How to Parallelize Indexing on the GPU: parallel processing of GPU indexing needs to take into account the architectural characteristics of the GPU. First, the GPU is connected to the system memory via the PCI interface, and the data used for the calculation must be transferred from the system memory to the GPU device memory. Second, GPU threads in the same workgroup run the same command regardless of branching, even though each thread has its own execution path. Therefore, many branch operations can degrade computational performance. This means we have to design the GPU code in a rather simple way. Third, GPUs have many computing cores and large memory bandwidth. Therefore, we can calculate large amounts of data at a time. This means that allocating data to all computing cores and setting up data layouts is critical to taking full advantage of all GPU resources. The GPU also performs bin-based indexing just like on a CPU. However, considering the characteristics of the advanced GPU architecture, we organize one bin into a linear table structure rather than a tree structure. This continuous data layout is useful when utilizing the GPU s local memory. This is because copying data from GPU global memory to local memory can be done naturally if the thread accesses the data continuously. It also does not cause multiple branch operations. The GPU can check the redundancy of data by comparing a single hash table. Also, only the hash value persists in GPU memory, and other metadata in the chunk is maintained in system memory. This is because transferring data can be a direct update process. This means that there is no other hash table update overhead on the GPU. Therefore, the result of whether an index is hit or not includes an index number and a hit/miss information pair. The metadata space structure in system memory then uses the results of the GPU. (3) When to use GPU for indexing: we decide how to apply GPU for indexing. To do this, we compare the CPU and GPU indexing performance. The number of hash table entries used for indexing remains the same on the CPU and GPU for a fair comparison. Preliminary experiments show that CPU performance is 4.16 to 5.45 times better than GPU performance in terms of execution time. For GPU indexing, the execution time is fixed because of the inevitable time at which the GPU kernel starts. This means that even with high-performance GPUs, there is a limit to optimizing indexing on the GPU. Therefore, we decide to use GPU only when CPU utilization is full and there is still some work to do for indexing.

304 J. Ma and C. Park 3.2 Parallel Data Compression on Multi-core CPU and GPU In this section, we focus on the way to parallelize LZ compression schemes that are commonly used in primary storage systems. (1) How to parallelize compression for CPU: As with hashing operations, there is no data dependency between chunks, so we can run compression independently on each chunk. CPU-based compression algorithms have been well studied previously. Therefore, the compute is parallelized by the CPU by assigning a computing thread that runs the previously studied compression algorithm to each chunk. (2) How to parallelize compression for GPU: Ozsoy et al. [3] introduced a parallel compression algorithm on the GPU. This algorithm divides the data into several sub-blocks and calculates the compression result in each sub-block and merges it in the CPU. This algorithm has a weakness to apply as a compression algorithm for primary storage systems. This algorithm assumes that the size of the data to be compressed is large enough to take full advantage of the GPU resources. This means that it does not work well for small-sized target data. The size of the chunk is 4 KB. Only a small number of computing cores can be allocated to compute the compression result of 4 KB chunks. Therefore, we design a compression algorithm that computes the chunk compression results at a time. The GPU allocates multiple threads for each chunk. Each stage performs its own LZ compression algorithm with its own history buffer and look-ahead buffer. Adjacent threads inspect overlapping regions by the size of the history buffer. The GPU s compression results are not refined in GPU due to performance issues. Therefore, the CPU must refine the results. It is called as post-processing. (3) How to use the GPU for compression: we compare the compression performance of the CPU and the GPU to determine when to use the GPU. Experimental results show that GPU performance is 88.3% better than CPU performance in terms of execution time (In Sect. 4). The performance gap is large. Therefore, the GPU performs compression and the CPU is used for refinement. 3.3 Putting It All Together This section describes how to incorporate two parallel data reduction operations called deduplication and compression. First, we need to determine the order of which operation should be applied. Based on the result of [5], we adopt deduplication-before-compression order for higher data reduction ratio. Second, we add a bin buffer structure to the data deduplication algorithm. The bin buffer is used to temporarily store a hash for each bin before moving each bin to the GPU memory and bin tree. When the buffer is full, the hash is immediately flushed from the buffer to the storage. This creates the appropriate sequential writes for the SSD. Figure 1 shows a workflow that incorporates deduplication and compression operations on the CPU and GPU. GPU indexing is performed if the GPU is available, and CPU indexing is performed if duplicate hashes are not found. For the CPU indexing path, the bin buffer is checked first, because recently updated chunks can reside in the bin buffer and chunks are more likely to find duplicates in the bin buffer due to temporal locality. If there are no duplicates in the bin buffer, check the

Parallelizing Inline Data Reduction Operations 305 bin tree to store most of the hash table entries. If we cannot find any duplicate, then the chunk is regarded a unique chunk. Therefore, the chunk becomes the compression target. After compressing the data, the bin buffer is updated because the chunks are unique. If the bin buffer becomes full, the buffer will be flushed to the storage. And then, GPU bin in GPU memory are updated accordingly. Currently, random based replacement policy is applied. Fig. 1. An integrated workflow of deduplication and compression proposed for data reduction operations 4 Evaluation This section evaluates the throughput of the parallel data reduction operations on the CPU and GPU. The vdbench is used to generate datasets. Our test machine equipped with Intel i7-3770 k, Radeon HD 7970, and 16 GB main memory. The vdbench [4] is used to generate the dataset. The size of the data stream is about 2 GB. The deduplication and compression ratio are set to 2.0, which is a common ratio for primary storage systems. We compare our schemes with the throughput of Samsung SSD 830. In this section, the Samsung SSD 830 is simply referred to as the SSD. (1) Parallel data deduplication: the GPU performs indexing of only a small portion of the chunk. The workflow for the integrated CPU and GPU for indexing is the same as in Fig. 1, except for the compression phase. Experimental results show that the GPU-supported data deduplication scheme can improve throughput by 15% over CPU-only data deduplication scheme. In addition, it shows three times the throughput of the SSD. (2) Parallel data compression: the proposed technique uses the GPU for compression and the CPU for post-processing of compression. Due to the nature of the compression technique, the throughput is high when the compression ratio is high. The CPUbased compression method has lower performance (about 50 K IOPS) than SSD throughput (about 80 K IOPS) when the compression ratio is low, but the GPUbased parallel compression method has the performance of 100 K IOPS even when the compression ratio is low. It always shows higher performance than SSD throughput. (3) Putting it all together - Parallelizing both data deduplication and compression together: In an environment where CPU and GPU are available, there are several

306 J. Ma and C. Park options for integrating two data reduction operations, deduplication and compression. The first option is to use the GPU in two data reduction operations. The second option is to use the GPU for only one data reduction operation. The last option is that both data reduction operations do not use the GPU at all. The last option may be useful when the performance of the GPU is poor. Figure 2 shows the throughput of these options. Fig. 2. Throughput comparison of integration methods Allocating the GPU for compression is the best choice among the integration methods. This is because data compression, which has a high performance gain when using a GPU, monopolizes the GPU. However, because hardware specifications may be different on different platforms, we cannot guarantee that this integration is always right. Therefore, before assigning processors to each data reduction operation, the performance of these integration methods is compared using dummy I/O to determine the best fit for throughput. Therefore, we can ensure the best performance even if the target platform is different. 5 Related Works There have been lots of previous researches which investigated the way to improve the throughput of data reduction operations. There exist some researches exploiting parallelism in data deduplication system. Xia W. et al. [6] proposed multicore-based parallel data deduplication approach. However, the problem is that they did not consider the operation of indexing which is known as main bottleneck in data deduplication [1]. Kim et al. [7] proposed GPUbased data deduplication approach for the primary storage. However, they did not consider utilizing CPU that performs better than GPU for indexing operation. There exist some researches exploiting GPU parallelism for compression operation. Ozsoy et al. [3] introduced the parallel compression algorithms on GPU. However, the compression target data are quite large to utilize GPU resource fully. This feature does not match with primary storage system that conducts compression for 4 KB of several chunks. Moreover, there exist researches introducing CPU parallel algorithms for compression. Shmuel et al. [8] introduce the algorithm for the compression executed

Parallelizing Inline Data Reduction Operations 307 using the tree-structured hierarchy. Gonzalo et al. [9] introduce the algorithm dividing data stream into several small subset and allocating each threads to the subset of data. Even they parallelize the compression for CPU, our GPU-based approach is better than at least about 88.3%. There exists a research analyzing the effect of mixing two data reduction operations, deduplication and compression. Constantinescu et al. [5] analyze the data reduction ratio when deduplication and compression are applied together. However, it focuses only data reduction ratio, not throughput. 6 Conclusion Throughput is becoming more important as data reduction operations are applied to save space on SSD-based primary storage systems. To solve this problem, we proposed parallel data reduction operations using multi-core CPU and GPU. We also showed how to integrate deduplication and compression technologies on multicore CPUs and GPUs. Applying our parallel approach to deduplication is 3 times better than SSD s throughput. For compression, the throughput of the parallel compression method supported by the GPU is 88.3% better than the average throughput of parallel QuickLZ. Finally, GPUsupported integration shows a performance improvement of 89.7% over parallel data reduction operations using CPU (deduplication ratio 2.0, compression 2.0). This means that our proposed technique enables easy application of data reduction operations to SSD-based primary storage systems. References 1. Guo, F., Efstathopoulos, P.: Building a high-performance deduplication system: In: USENIX Annual Technical Conference (2011) 2. De Agostino, S.: Lempel-Ziv data compression on parallel and distributed systems. Algorithms 4, 183 199 (2011) 3. Ozsoy, A., Swany, M., Chauhan, A.: Pipelined parallel LZSS for streaming data compression on GPGPUs. In: Parallel and Distributed Systems, pp. 37 44 (2012) 4. Berryman, A., Calyam, P., Honigford, M., Lai, A.M.: Vdbench: a benchmarking toolkit for thin-client based virtual desktop environments. In: Cloud Computing Technology and Science, pp. 480 487 (2010) 5. Constantinescu, C., Glider, J., Chambliss, D.: Mixing deduplication and compression on active data sets: In: Data Compression Conference, pp. 393 402 (2011) 6. Xia, W., Jiang, H., Feng, D., Tian, L., Fu, M., Wang, Z.: P-dedupe: exploiting parallelism in data deduplication system: In: Networking, Architecture and Storage, pp. 338 347 (2012) 7. Kim, C., Park, K.W., Park, K.H.: GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system. In: Proceedings of the International Workshop on Programming Models and Applications for Multicores and Manycores, pp. 17 26 (2012) 8. Klein, S.T., Wiseman, Y.: Parallel Lempel Ziv coding (extended abstract). In: Amir, A. (ed.) CPM 2001. LNCS, vol. 2089, pp. 18 30. Springer, Heidelberg (2001). doi: 10.1007/3-540-48194-X_2 9. Navarro, G., Raffinot, M.: Practical and flexible pattern matching over Ziv-Lempel compressed text. J. Discrete Algorithms 2, 347 371 (2004)