DDSF: A Data Deduplication System Framework for Cloud Environments

Size: px

Start display at page:

Download "DDSF: A Data Deduplication System Framework for Cloud Environments"

Ethan Richard
5 years ago
Views:

1 DDSF: A Data Deduplication System Framework for Cloud Environments Jianhua Gu, Chuang Zhang and Wenwei Zhang School of Computer Science and Technology, High Performance Computing R&D Center Northwestern Polytechnical University, Xi an, China gujh@nwpu.edu.cn, pers@mail.nwpu.edu.cn Keywords: Abstract: Cloud Storage, Data Deduplication, Hash Table Partition, Index File System. Cloud storage has been widely used because it can provide seemingly unlimited storage space and flexible access way, while the rising cost of storage and communications is an issue. In this paper, we propose a Data Deduplication System Framework(DDSF) for cloud storage environments. The DDSF consists of three major components, the client, fingerprint server and storage component. The client component divides file into chunks and calculates hash value of each chunk. And it only sends the chunk whose hash value has not existed in fingerprint server component to the storage component, so it can reduce consumption of communications and storage space. We developed an Index File System(IFS) to manage the metadata of user file in fingerprint server. The fingerprint server component maintains a hash table containing the hash values of chunks already stored in storage component. This paper presents a two-level indexing mechanism to improve the both spatial and temporal overhead to access to the hash table. The first level employs the Bloom Filter and the second level uses hash table partition mechanism. In order to reduce the possibility of accidental collision of data block hash value, we use two hash algorithms to calculate the hash value of the chunk. We further conduct some experiments in our framework. The results of these experiments demonstrate that framework proposed in this paper can improve storage space utilization and reduce consumption of communications. 1 INTRODUCTION Cloud storage has become an effective way for people to store their massive data, due to its high reliability, high scalability(bonvin, 2010). However, with the increase of data quantity, the costs of the storage and communications increase exponentially, so it s very important to seek an effective storage method to improve the utilization of storage space and accelerate the storing speed. According to (Soghoian, 2011), an average of 60% of data may be deduplicated in Dropbox. The duplicated data may introduce more overhead of storage space and communications. Data deduplication is technique for eliminating duplicate copies of repeating data, and it is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent. Data deduplication can detect and remove duplicated data based on two different levels, the file-level or blocklevel(maddodi,2010). Cross-user deduplication is used in practice to maximize the benefits of data deduplication, which identifies redundant data among different users, and removes the redundancy and just saves a single copy of the duplicate data. So data deduplication in cloud storage system can not only achieve higher storage efficiency, saving the cost of cloud storage providers, but also can reduce storage and communication costs for consumers who use the cloud storage system. This paper presents a Data Deduplication System Framework for cloud storage ----DDSF, which adds deduplication mechanism to the common cloud storage system to reduce data redundancy without changing the common cloud storage framework. It can solve the problem of low utilization of storage space in the common cloud storage. The DDSF is composed of three components, the client component, the fingerprint server component and the storage component. The fingerprint server component is responsible for storing and retrieving hash value of data block, named fingerprint. Before storing a file, the client component will divide the file into blocks and calculate the hash value of each 403

2 CLOSER2014-4thInternationalConferenceonCloudComputingandServicesScience block. By communicating with the fingerprint server component, the client component can know which block has already existed in the storage component, and then it will only transmit the block which is not in the storage component. So it can effectively reduce the bandwidth consumption of communications. In DDSF, we use two hash algorithms to calculate the hash value of the data block, which can greatly reduce the possibility of accidental collision(henson, 2003) (Black, 2006). The size of hash table is increasing as amount of data stored in the cloud is increasing, so it s quite inefficient to retrieve the hash table. We design a two-level indexing mechanism of hash table to speed up the hash table retrieval. And we also develop a file system, named IFS(Index File System). The IFS is the key function of the metadata management and access. The metadata is the information about the user s file, and that is the position of each block s hash value in the hash table. The IFS uses tree structure to organize directory. The content of file in IFS is no longer real data of user file, but the addresses of all chunks belonging to this file in the hash table. We use the system image that stores a copy of the IFS and log file of operations on IFS to enhance the persistence and robustness of the IFS. In our implementation of DDSF, we use HDFS (Hadoop Distribute File System) as a storage system(borthakur, 2007). HDFS is a distribution file system run on commodity hardware and it was developed by Apache Software Foundation for managing massive data. The rest of this paper is organized as follows. Related works are discussed in Section Ⅱ. The proposed system framework and its main components are discussed in detail in Section Ⅲ. Some experiments are conducted in Section Ⅳ. Conclusion is in Section Ⅴ. 2 RELATED WORKS As the amount of data stored in the cloud storage system increase, the size of the hash table will also increase, so how to efficiently search the hash table has become a key issue of data deduplication. In (Zhu, 2008), the authors described a Data Domain deduplication system, in which proposed to use multiple indexes and search the entire index table step-by-step to improve efficiency. But it has much more storage space overhead. In this paper, we develop a two-level index mechanism to improve the both spatial and temporal overhead. The first level index is based on the Bloom Filter (Broder,2003) (Bloom, 1970) technology, and the second level index uses partition mechanism which divides hash table into pieces of suitable size. The Bloom Filter can use the memory space efficiently. It maintains a binary vector data structure in memory and can quickly test whether a certain element is member of the set. False positive matches are possible, but false negatives are not; i.e. a query returns either "definitely not in set" or "possibly in set". For the first case, we can conclude that this element does not exist immediately, and for the second case, we should do further search in second level. Several deduplication storage systems have been previously designed, including Venti (Sean, 2002), Extreme Binning (Bhagwat, 2009), DeDe (Austin, 2009), DeDu (Sun, 2011). Venti is a data archiving system based on network, and it uses unique hash values to identify data blocks and an append-only way to store data. And it lacks support for data deleting function, therefore it is not suitable for the cloud storage environment. Extreme Binning stores a separate entire file to a separate backup node. And each backup node is autonomous and manages its own index table and data, without sharing information among backup nodes and knowing each other. It does deduplication only in one backup node, so it is not suitable for massive data global deduplication. DeDe is a block-level deduplication cluster file system without central coordination. However it s only suitable for a virtual machine environment to do deduplication on the virtual image, not suitable for general data storage system. DeDu is a file-level deduplication system and entire file similarity detection is in the client. But the repetition rate of the file-level is not very high generally, so the deduplication rate is relatively low. In summary, this paper proposes a block-level and online deduplication data framework. The file chunking and data block fingerprint calculation is placed in the client, which can reduce the server calculated pressure. Moreover it identifies duplicate data before data transmission, so it can reduce the amount of data transmission and bandwidth consumption. 3 FRAMEWORK The proposed framework named DDSF is shown in Figure 1. The DDSF is composed of three components, the client component, the fingerprint server component and the storage component. 404

3 DDSF:ADataDeduplicationSystemFrameworkforCloudEnvironments Client Component: In this component, file is divided into blocks(chunks) and the hash value of every block is calculated, then it sends these hash values to the fingerprint server component, which tests whether the data block is repeated in storage component and returns the results of test to client component. The client component then decides whether transmit the data block to the storage component based on these results. When users need to download a file, the chunk merging module can download the all chunks of a file from storage component directly and merges these chunks into a file. Fingerprint Server Component: There is a file system, named IFS, that includes two main functions: metadata management and hash table management in this component. The metadata management mainly manages metadata of user s file and operations on files. The hash table management maintains the hash table of all blocks and efficiently access to the hash table. Each hash table entry contains the block s hash value, the actual address of data block in the storage system, the repeat number of block and so on. Storage Component: We use HDFS as storage system, and in fact the data blocks of user files are stored in this place. DDSF realizes online deduplication. The client component can know which data block is already stored in the storage component before sending the data through interaction with fingerprint server component. So it can save storage space and reduce bandwidth consumption with only sending the no duplicated block to HDFS. The client component and fingerprint server component are the main function components of the framework. through IFS, it is a file system. The IFS is the key function of the user file s metadata management and access. The metadata includes file s attributes and addresses of file data block, and the later are the pointers to an entry of hash table. For each user s file, the IFS will create a corresponding file in IFS, the shadow file or metadata file. Instead of real content of a user file, the metadata of user file are stored in the shadow file. When fingerprint server component receives the command from client component to store a new file to storage component, it will create a new metadata file in IFS according to the file name from client component. And then fingerprint server component tests if the hash value of chunk from client component has already existed in the hash table. If so, it will increase the counter of hash table entry by 1 and store the address of this entry to the metadata file. If not, it will insert a new entry about the hash value and address of this block stored in storage component into the hash table, and then store the address of the new entry to the metadata file. When fingerprint server component receives the command from client component to download a file from storage component, it gets the address of hash table entry from metadata file. According to the address, fingerprint server component gets the address of chunk stored in the storage component form the hash table and sends the address of chunk to client component, which will directly download the chunk in term of address from the storage component. The IFS realizes metadata management and hash table management functions. The IFS uses a tree structure to manage the directory of metadata file. It creates a child node for each user on the tree and each user s directory structure is stored in a sub-tree rooted on that node. And each user can only access the files in his/her own directory. The hash table management is the key of IFS Two-level Index Mechanism Figure 1: Framework Architecture. 3.1 Fingerprint Server Component Index File System The main function of this component is achieved The IFS maintains a hash table. Each record of the hash table stores one data block s address in storage component, hash value of the block and the number of repetition of the data block. With the increasing of number of chunks stored in system, both storage and retrieval the hash table are very complex issue. In order to speed up the access to the hash table and to decrease amount of storage space, we use Bloom Filter and hash table partition technique. The twolevel index mechanism is shown in Figure 2. The first level index is completed by the Bloom Filter, 405

4 CLOSER2014-4thInternationalConferenceonCloudComputingandServicesScience and the second level index is completed by searching for a slice of the hash table. The pseudocode of searching fingerprint with the two-level index mechanism is shown in Algorithm 1. Algorithm 1 Searching fingerprint Bool Search_fingerprint(fingerprint) Begin 16_bit_fp=To_16_bit(fingerprint); if(hit_in_bloom_filter(16_bit_fp)) if(found_in_slice(16_bit_fp)) return TRUE; else return FALSE; else return FALSE; End Figure 2: The two-level index mechanism. The Bloom Filter table can reside in memory, so it has higher query efficiency. A bit in Bloom Filter represents the status weather a hash value exists in the hash table. We use the hash value calculated by the MD5 algorithm as main index value, and the hash value length is 128 bits. Obviously, if we want to use all of values generated by MD5 to create the Bloom Filter table, its size will be bits, which is too big to load into memory. So we design a compressed hash value indexing technique. It divides 128-bit hash value into eight 16-bit-parts, and then does Exclusive OR operation on these 8 parts in order to get a new 16-bit hash value. We use this new 16-bit hash value to create the Bloom Filter table. The Bloom Filter table size will be 2 16 (64K), so it can reside in memory. Once the fingerprint server component receives the hash value sent by the client component, it tests the status of the specific bit in the Bloom Filter table with the hash value. If the bit is not set, that means this chunk is not duplicated. Then client component will store this data block into HDFS directly. And the fingerprint server component adds the location of data block in HDFS and hash value to the hash table as an entry. Finally it will set the bit in Bloom Filter according to the hash value. If the bit is set, then it will perform the next level index operation. In second level indexing, the hash table is divided into multiple parts and each part is separately stored in different directory on the IFS as a file. The directories are created according to 16- bit value by Exclusive OR operation, and each possible value corresponds to a directory. And for all hash values of hash table in the same directory, they have the same 16-bit value. In the indexing process, it will search the specified directory according to the 16-bit value generated by a hash value. And the slice in this directory will be loaded into memory, so it can ensure not only the size of the hash table in memory, but the performance of the indexing. The hash value will be judged as a new one if it does not exist in the slice, so the data block should be stored into HDFS, then the storage address and hash value should be stored in this slice. The hash table located in different directory can be divided into several fixed-size slices according to the record number, and each slice is a hash sub-table. In order to prevent a single hash table becomes too large, we can set up a maximum record number, such as 16M. When a hash table size exceeds this maximum, then the system will create a new hash table. This will ensure the appropriate size of the hash table System Image and Operation Log We develop the system image and log file of operations on the IFS to enhance its robustness to recover from mistakes. System image is used to store persistently the structure of file directory in memory, and operation log file is used to record user s operation information on IFS. When the system is activated, it simultaneously reads the system image and operation log from the disk, then combines these two files to generate new system image and simultaneously generates the directory structure of IFS in memory. Finally it deletes the old operation log and recreates a new blank log file. The system will periodically store the structure of file directory in memory into the system image, and update the operation log file. For the concurrency issue of access to operation log information, we set two buffers and allow only one user to access one buffer at a certain moment. When one user access to the first buffer to record operation information, the buffer will be locked, then the system sends the operation information in first buffer to the second buffer and unlocks the first buffer. After the first buffer is unlocked, other user can access the first buffer, and the second buffer can write these information into the log file. Because the 406

5 DDSF:ADataDeduplicationSystemFrameworkforCloudEnvironments I/O operation is performed in the second buffer, the second user can immediately write operation log when the previous user leaves from the first buffer, so it can greatly improve the response speed for multiple users to write log file. 3.2 Client Component Client component is a tool through which user accesses to the cloud storage system. The client consists of file chunking module and chunk merging module. File Chunking Module: In this module, a file is broken down into chunks of fixed size or unfixed size, and the hash values for each chunk are calculated using both MD5 and SHA-1 algorithm. Then the client sends the command of storing file and the file chunk s hash value list to the fingerprint server component. The fingerprint server component creates a new file in the IFS to store the metadata of user file, and searches the hash table to test which chunk already exists according to the list of the hash values. The fingerprint server component will inform the client component to send the chunks which have not been duplicated to HDFS directly. Chunk Merging Module: When user wants to download a file from cloud, this module will consult the fingerprint server component to obtain all chunks address of this file, and then directly access to HDFS to get chunks block by block according to these chunks addresses, eventually this module merges these chunks into a file. In all, the client component can know whether the chunk is repeated before chunk transmission and only sends the unrepeated chunk, which can reduce traffic and bandwidth consumption. 4 EXPERIMENTS Our experiment platform was set up on 6 computers. Four computers serve as the cloud storage component, including one NameNode and three DataNodes. The fifth computer serves as the fingerprint server component, and the last one serves as the client component. The detailed configuration information is listed in Table 1. Table 1: Configuration of Six Computers. Hosts CPU RAM Hard Disk IP Address NameNode Intel(R) Xeon(R) 8GB 120GB DataNode1 DataNode2 DataNode3 Hash Server Client Intel(R) Xeon(R) Intel(R) Xeon(R) Intel(R) Xeon(R) Intel(R) Xeon(R) IntelCore(TM)2 Duo 4.1 File Consistency 8GB 120GB GB 120GB GB 120GB GB 120GB GB 120GB In our experiment, we uploaded some files into DDSF, then downloaded these files and compared the downloaded files with the source files to check whether these files are consistent after deduplication. These files are built for a special purpose and consist of pictures, audio, binary, text and video files. We completed three experiments and set the data block size to 1M. Details of the source file size, the space used after deduplication and file consistency are given in Table 2. Test Number of Files Table 2: File Consistency. Source File Size Size after Deduplication File Consistency MB 87MB Y MB 425MB Y GB 726MB Y According to the results shown in Table 2, three tests have a different number of file, file type, file size and repetition rate, and all files are consistent in DDSF. Therefore our framework can maintain file consistency and integrity regardless of the file type and size. 4.2 Writing Efficiency In this part, we will discuss the writing efficiency with and without DDFS. We used 6 groups of data sets and the data repetition rates were 0%, 20%, 40%, 60%, 80% and 100% respectively. The size of each group was 2.3G and the chunk size was 10M. We uploaded each data set and measured its loading time with and without DDSF. The time it takes to upload files with and without DDSF is shown in Figure

6 CLOSER2014-4thInternationalConferenceonCloudComputingandServicesScience In figure 3, data repetition rate is plotted on the -axis and time taken to save the file is plotted on the -axis. As shown in the figure 3, with data repetition rate increasing, the time taken to save files with DDSF reduces rapidly. When the data repetition rate is 5%, the time taken to save the same files with DDSF and without DDSF is the about same. block size, amount of data, the number of hash table s record and the number of repeated record. Figure 4: Time to Read files with and without DDSF. Figure 3: Time to Save files with and without DDSF. 4.3 Reading Efficiency This part introduces the results of reading efficiency with and without DDSF. We separately download the previously stored data sets and measured each download time. The results are shown in figure 4. As shown in figure 4, the average time taken to read files without DDSF is about 253s, and the time with DDSF is about 260s. The time increases by only 2.4% with DDSF than without DDSF. As mentioned before, there is an average of 60% of duplicated data in Dropbox. If using DDSF, we can save nearly 60% of storage space, although the time taken to read files will increase by 2.4%, whereas the time taken to upload files will reduce by 56.7%. In summary, the performance of the entire storage system can be largely improved by using DDSF. 4.4 Repetition Rate In this section, we discuss the relationship among different block size, the amount of data and the data repetition rate. The experimental files are extracted from eight computers in our lab, and the total size of the data is 512GB. The block size has two levels, which are KB level and MB level. Each level is divided into three sub-levels which are 8KB, 16KB, 64KB, 1MB, 2MB, and 4MB. The amounts of data are 16GB, 32GB, 64GB, 128GB, 256GB and 512GB. Figure 5 shows the relationship between As shown in figure 5-1, in a certain amount of data, the number of hash table s record will increase with the decrease of the block size. If the block size is 8KB and the amount of data is 512GB, the number of hash table s record is That number will be 10 5 if the block size is 4MB. Meanwhile, the size of hash table is liner growth with the increase of data. In figure 5-2, the number of repeated hash value will increase when the amount of data increases or the block size reduces. The repetition rates are shown in table 3. BS denotes Block Size, RR denotes Repetitive Rate, AOD denotes Amount of Data. With the growth of block size, the repetition rate decreases. The repetition rate decreases with the growth of block size. This is because the repetition rate is lower with the larger block size. Because the growth rate of the number of block is higher than that of the number of repeated hash entry, the repetition rate will reduce when amount of data increase. Repetition_Rate = NRHE / NB (1) where NRHE denotes the number of repeated hash entry and NB denotes the number of block. RR BS Table 3: Repetition rates. AOD 8KB 16KB 64KB 1M 2MB 4MB 16G 38.4% 38.0% 36.9% 34.4% 33.2% 31.3% 32G 28.6% 28.2% 27.3% 25.1% 23.8% 22.6% 64G 22.9% 22.1% 20.8% 16.5% 15.4% 14.2% 128G 17.5% 16.9% 15.2% 12.6% 11.7% 11.0% 256G 15.9% 14.6% 14.1% 13.2% 10.9% 9.4% 512G 14.9% 13.0% 12.8% 11.2% 10.3% 8.7% 408

7 DDSF:ADataDeduplicationSystemFrameworkforCloudEnvironments Figure 5.1. Figure 5.2. Figure 5.3. Figure Indexing Efficiency We will discuss the efficiency of the two-level indexing mechanism. We use five different hash tables obtained in the section 4.4 and measure the time to retrieve hash table. These hash tables are generated in the condition that the chunk size is 8KB and the amounts of data are 16GB, 32GB, 64GB, 128GB and 256GB. The number of hash tables records is , , , and respectively. We do four experiments for each hash table. The results are shown in Figure 6. Figure 6: The efficiency of Indexing. Table 4: The four tests. Test Number of hash values Number in Bloom Filter Number in hash table We use five thousands of hash values to retrieve the hash table in each experiment. The features of these hash values are shown in table 4. In figure 6 shows that, the time in test 2 is much more than that in test 1. Because searching hash values which have not existed in the hash table will need to search the entire slice, which will cost more time. The time in test 3 is less than that in test 1, because some hash values don t need to do further searching if they do not hit in Bloom Filter. 5 CONCLUSIONS In this paper, we propose a deduplicated cloud storage framework---ddsf. The framework is designed to reduce storage space and bandwidth consumption to upload data to cloud. The client component consults the fingerprint server component to determine whether the chunk is repeated before sending the chunk to cloud, and it only sends the unrepeated chunk to the underlying storage system, so it can reduce the amount of communication and improve the storage space utilization. This paper presents an efficient two-level indexing mechanism in order to improve the retrieval efficiency of the hash table. We employ the Bloom Filter and hash table partition mechanism. It divides the hash table into multiple parts and stores these parts in different directory according to a 16- bit value generated by hash value. When searching a hash value, it will only load the slice which has the same 16-bit value with the hash value into memory, so it can further accelerate the retrieval speed and improve efficiency. This paper also introduces the 409

8 CLOSER2014-4thInternationalConferenceonCloudComputingandServicesScience system image and operation log. The system image preserves the directory structure of the entire system, and the operation log records user operations on directory, so the robustness of the IFS can be enhanced. The retrieval of hash tables is the system bottleneck. In future work, we will develop more efficient indexing mechanism to reduce the indexing time so as to further optimize performance. ACKNOWLEDGEMENTS This research is supported by the project of Science and Technology Department of Shaanxi Province and by the National High-tech R&D Program of China (863 Program) under Grant No.2009AA01Z142. Technologies, ed. Monterey, CA: USENIX Association, 2002, pp D. Bhagwat, Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup. in 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems Mascots, 2009, pp T. C. Austin, Decentralized deduplication in SAN cluster file systems. in Proceedings of the 2009 conference on USENIX Annual technical conference, San Diego, CaJifornia. 2009, pp Zhe SUN, DeDu: Building a Deduplication Storage System over Cloud Computing. in Proceedings of the th International Conference on Computer Supported Cooperative Work in Design(CSCWD), pp B. Bloom, Space/Time Tradeoffs in Hash Coding with Allowable Errors. Communications of the ACM, 13(7): , REFERENCES Nicolas Bonvin, A self-organized, fault-tolerant and scalable replication scheme for cloud storage[a]. Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC'10. New York: ACM Press, ~216 C. Soghoian, How dropbox sacrifices user privacy for cost savings. 04/how-dropbox-sacrifices-userprivacy-for.html, April 2011, online. Maddodi, S., Data Deduplication Techniques and Analysis[A]. In Proceedings-3rd International Conference on Emerging Trends in Engineering and Technology, ICETET New Jersey: IEEE Compute Society Press, ~668. V. Henson, An analysis of compare-by-hash. in Proceedings of the 9th conference on Hot Topics in Operating Systems Lihue, Hawaii, 2003, pp J. Black, Compare-by-hash: A reasoned analysis. in USENIX Association Proceedings of the 2006 USENIX Annual Technical Conference, 2006, pp D. Borthakur, The Hadoop Distributed File System: Architecture and Design URL: apache.org/hdfs/docs/current Ihdfs_design.pdf. Benjamin Zhu, Avoiding the disk bottleneck in the data domain deduplication file system[a]. In Proc. of the 6th Usenix Conf. on File and Storage Technologies (FAST 2008). Berkeley: USENIX Association, Broder A. Z., Mitzenmacher M, Network applications of bloom filters: A survey. Internet Mathematics, 2003, 1(4): Q. Sean, D. Sean, Venti: A New Approach to Archival Data Storage. In Proceedings of the 1st USENfX Conference on File and Storage 410

New research on Key Technologies of unstructured data cloud storage

2017 International Conference on Computing, Communications and Automation(I3CA 2017) New research on Key Technologies of unstructured data cloud storage Songqi Peng, Rengkui Liua, *, Futian Wang State