DDSF: A Data Deduplication System Framework for Cloud Environments

Size: px
Start display at page:

Download "DDSF: A Data Deduplication System Framework for Cloud Environments"

Transcription

1 DDSF: A Data Deduplication System Framework for Cloud Environments Jianhua Gu, Chuang Zhang and Wenwei Zhang School of Computer Science and Technology, High Performance Computing R&D Center Northwestern Polytechnical University, Xi an, China gujh@nwpu.edu.cn, pers@mail.nwpu.edu.cn Keywords: Abstract: Cloud Storage, Data Deduplication, Hash Table Partition, Index File System. Cloud storage has been widely used because it can provide seemingly unlimited storage space and flexible access way, while the rising cost of storage and communications is an issue. In this paper, we propose a Data Deduplication System Framework(DDSF) for cloud storage environments. The DDSF consists of three major components, the client, fingerprint server and storage component. The client component divides file into chunks and calculates hash value of each chunk. And it only sends the chunk whose hash value has not existed in fingerprint server component to the storage component, so it can reduce consumption of communications and storage space. We developed an Index File System(IFS) to manage the metadata of user file in fingerprint server. The fingerprint server component maintains a hash table containing the hash values of chunks already stored in storage component. This paper presents a two-level indexing mechanism to improve the both spatial and temporal overhead to access to the hash table. The first level employs the Bloom Filter and the second level uses hash table partition mechanism. In order to reduce the possibility of accidental collision of data block hash value, we use two hash algorithms to calculate the hash value of the chunk. We further conduct some experiments in our framework. The results of these experiments demonstrate that framework proposed in this paper can improve storage space utilization and reduce consumption of communications. 1 INTRODUCTION Cloud storage has become an effective way for people to store their massive data, due to its high reliability, high scalability(bonvin, 2010). However, with the increase of data quantity, the costs of the storage and communications increase exponentially, so it s very important to seek an effective storage method to improve the utilization of storage space and accelerate the storing speed. According to (Soghoian, 2011), an average of 60% of data may be deduplicated in Dropbox. The duplicated data may introduce more overhead of storage space and communications. Data deduplication is technique for eliminating duplicate copies of repeating data, and it is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent. Data deduplication can detect and remove duplicated data based on two different levels, the file-level or blocklevel(maddodi,2010). Cross-user deduplication is used in practice to maximize the benefits of data deduplication, which identifies redundant data among different users, and removes the redundancy and just saves a single copy of the duplicate data. So data deduplication in cloud storage system can not only achieve higher storage efficiency, saving the cost of cloud storage providers, but also can reduce storage and communication costs for consumers who use the cloud storage system. This paper presents a Data Deduplication System Framework for cloud storage ----DDSF, which adds deduplication mechanism to the common cloud storage system to reduce data redundancy without changing the common cloud storage framework. It can solve the problem of low utilization of storage space in the common cloud storage. The DDSF is composed of three components, the client component, the fingerprint server component and the storage component. The fingerprint server component is responsible for storing and retrieving hash value of data block, named fingerprint. Before storing a file, the client component will divide the file into blocks and calculate the hash value of each 403

2 CLOSER2014-4thInternationalConferenceonCloudComputingandServicesScience block. By communicating with the fingerprint server component, the client component can know which block has already existed in the storage component, and then it will only transmit the block which is not in the storage component. So it can effectively reduce the bandwidth consumption of communications. In DDSF, we use two hash algorithms to calculate the hash value of the data block, which can greatly reduce the possibility of accidental collision(henson, 2003) (Black, 2006). The size of hash table is increasing as amount of data stored in the cloud is increasing, so it s quite inefficient to retrieve the hash table. We design a two-level indexing mechanism of hash table to speed up the hash table retrieval. And we also develop a file system, named IFS(Index File System). The IFS is the key function of the metadata management and access. The metadata is the information about the user s file, and that is the position of each block s hash value in the hash table. The IFS uses tree structure to organize directory. The content of file in IFS is no longer real data of user file, but the addresses of all chunks belonging to this file in the hash table. We use the system image that stores a copy of the IFS and log file of operations on IFS to enhance the persistence and robustness of the IFS. In our implementation of DDSF, we use HDFS (Hadoop Distribute File System) as a storage system(borthakur, 2007). HDFS is a distribution file system run on commodity hardware and it was developed by Apache Software Foundation for managing massive data. The rest of this paper is organized as follows. Related works are discussed in Section Ⅱ. The proposed system framework and its main components are discussed in detail in Section Ⅲ. Some experiments are conducted in Section Ⅳ. Conclusion is in Section Ⅴ. 2 RELATED WORKS As the amount of data stored in the cloud storage system increase, the size of the hash table will also increase, so how to efficiently search the hash table has become a key issue of data deduplication. In (Zhu, 2008), the authors described a Data Domain deduplication system, in which proposed to use multiple indexes and search the entire index table step-by-step to improve efficiency. But it has much more storage space overhead. In this paper, we develop a two-level index mechanism to improve the both spatial and temporal overhead. The first level index is based on the Bloom Filter (Broder,2003) (Bloom, 1970) technology, and the second level index uses partition mechanism which divides hash table into pieces of suitable size. The Bloom Filter can use the memory space efficiently. It maintains a binary vector data structure in memory and can quickly test whether a certain element is member of the set. False positive matches are possible, but false negatives are not; i.e. a query returns either "definitely not in set" or "possibly in set". For the first case, we can conclude that this element does not exist immediately, and for the second case, we should do further search in second level. Several deduplication storage systems have been previously designed, including Venti (Sean, 2002), Extreme Binning (Bhagwat, 2009), DeDe (Austin, 2009), DeDu (Sun, 2011). Venti is a data archiving system based on network, and it uses unique hash values to identify data blocks and an append-only way to store data. And it lacks support for data deleting function, therefore it is not suitable for the cloud storage environment. Extreme Binning stores a separate entire file to a separate backup node. And each backup node is autonomous and manages its own index table and data, without sharing information among backup nodes and knowing each other. It does deduplication only in one backup node, so it is not suitable for massive data global deduplication. DeDe is a block-level deduplication cluster file system without central coordination. However it s only suitable for a virtual machine environment to do deduplication on the virtual image, not suitable for general data storage system. DeDu is a file-level deduplication system and entire file similarity detection is in the client. But the repetition rate of the file-level is not very high generally, so the deduplication rate is relatively low. In summary, this paper proposes a block-level and online deduplication data framework. The file chunking and data block fingerprint calculation is placed in the client, which can reduce the server calculated pressure. Moreover it identifies duplicate data before data transmission, so it can reduce the amount of data transmission and bandwidth consumption. 3 FRAMEWORK The proposed framework named DDSF is shown in Figure 1. The DDSF is composed of three components, the client component, the fingerprint server component and the storage component. 404

3 DDSF:ADataDeduplicationSystemFrameworkforCloudEnvironments Client Component: In this component, file is divided into blocks(chunks) and the hash value of every block is calculated, then it sends these hash values to the fingerprint server component, which tests whether the data block is repeated in storage component and returns the results of test to client component. The client component then decides whether transmit the data block to the storage component based on these results. When users need to download a file, the chunk merging module can download the all chunks of a file from storage component directly and merges these chunks into a file. Fingerprint Server Component: There is a file system, named IFS, that includes two main functions: metadata management and hash table management in this component. The metadata management mainly manages metadata of user s file and operations on files. The hash table management maintains the hash table of all blocks and efficiently access to the hash table. Each hash table entry contains the block s hash value, the actual address of data block in the storage system, the repeat number of block and so on. Storage Component: We use HDFS as storage system, and in fact the data blocks of user files are stored in this place. DDSF realizes online deduplication. The client component can know which data block is already stored in the storage component before sending the data through interaction with fingerprint server component. So it can save storage space and reduce bandwidth consumption with only sending the no duplicated block to HDFS. The client component and fingerprint server component are the main function components of the framework. through IFS, it is a file system. The IFS is the key function of the user file s metadata management and access. The metadata includes file s attributes and addresses of file data block, and the later are the pointers to an entry of hash table. For each user s file, the IFS will create a corresponding file in IFS, the shadow file or metadata file. Instead of real content of a user file, the metadata of user file are stored in the shadow file. When fingerprint server component receives the command from client component to store a new file to storage component, it will create a new metadata file in IFS according to the file name from client component. And then fingerprint server component tests if the hash value of chunk from client component has already existed in the hash table. If so, it will increase the counter of hash table entry by 1 and store the address of this entry to the metadata file. If not, it will insert a new entry about the hash value and address of this block stored in storage component into the hash table, and then store the address of the new entry to the metadata file. When fingerprint server component receives the command from client component to download a file from storage component, it gets the address of hash table entry from metadata file. According to the address, fingerprint server component gets the address of chunk stored in the storage component form the hash table and sends the address of chunk to client component, which will directly download the chunk in term of address from the storage component. The IFS realizes metadata management and hash table management functions. The IFS uses a tree structure to manage the directory of metadata file. It creates a child node for each user on the tree and each user s directory structure is stored in a sub-tree rooted on that node. And each user can only access the files in his/her own directory. The hash table management is the key of IFS Two-level Index Mechanism Figure 1: Framework Architecture. 3.1 Fingerprint Server Component Index File System The main function of this component is achieved The IFS maintains a hash table. Each record of the hash table stores one data block s address in storage component, hash value of the block and the number of repetition of the data block. With the increasing of number of chunks stored in system, both storage and retrieval the hash table are very complex issue. In order to speed up the access to the hash table and to decrease amount of storage space, we use Bloom Filter and hash table partition technique. The twolevel index mechanism is shown in Figure 2. The first level index is completed by the Bloom Filter, 405

4 CLOSER2014-4thInternationalConferenceonCloudComputingandServicesScience and the second level index is completed by searching for a slice of the hash table. The pseudocode of searching fingerprint with the two-level index mechanism is shown in Algorithm 1. Algorithm 1 Searching fingerprint Bool Search_fingerprint(fingerprint) Begin 16_bit_fp=To_16_bit(fingerprint); if(hit_in_bloom_filter(16_bit_fp)) if(found_in_slice(16_bit_fp)) return TRUE; else return FALSE; else return FALSE; End Figure 2: The two-level index mechanism. The Bloom Filter table can reside in memory, so it has higher query efficiency. A bit in Bloom Filter represents the status weather a hash value exists in the hash table. We use the hash value calculated by the MD5 algorithm as main index value, and the hash value length is 128 bits. Obviously, if we want to use all of values generated by MD5 to create the Bloom Filter table, its size will be bits, which is too big to load into memory. So we design a compressed hash value indexing technique. It divides 128-bit hash value into eight 16-bit-parts, and then does Exclusive OR operation on these 8 parts in order to get a new 16-bit hash value. We use this new 16-bit hash value to create the Bloom Filter table. The Bloom Filter table size will be 2 16 (64K), so it can reside in memory. Once the fingerprint server component receives the hash value sent by the client component, it tests the status of the specific bit in the Bloom Filter table with the hash value. If the bit is not set, that means this chunk is not duplicated. Then client component will store this data block into HDFS directly. And the fingerprint server component adds the location of data block in HDFS and hash value to the hash table as an entry. Finally it will set the bit in Bloom Filter according to the hash value. If the bit is set, then it will perform the next level index operation. In second level indexing, the hash table is divided into multiple parts and each part is separately stored in different directory on the IFS as a file. The directories are created according to 16- bit value by Exclusive OR operation, and each possible value corresponds to a directory. And for all hash values of hash table in the same directory, they have the same 16-bit value. In the indexing process, it will search the specified directory according to the 16-bit value generated by a hash value. And the slice in this directory will be loaded into memory, so it can ensure not only the size of the hash table in memory, but the performance of the indexing. The hash value will be judged as a new one if it does not exist in the slice, so the data block should be stored into HDFS, then the storage address and hash value should be stored in this slice. The hash table located in different directory can be divided into several fixed-size slices according to the record number, and each slice is a hash sub-table. In order to prevent a single hash table becomes too large, we can set up a maximum record number, such as 16M. When a hash table size exceeds this maximum, then the system will create a new hash table. This will ensure the appropriate size of the hash table System Image and Operation Log We develop the system image and log file of operations on the IFS to enhance its robustness to recover from mistakes. System image is used to store persistently the structure of file directory in memory, and operation log file is used to record user s operation information on IFS. When the system is activated, it simultaneously reads the system image and operation log from the disk, then combines these two files to generate new system image and simultaneously generates the directory structure of IFS in memory. Finally it deletes the old operation log and recreates a new blank log file. The system will periodically store the structure of file directory in memory into the system image, and update the operation log file. For the concurrency issue of access to operation log information, we set two buffers and allow only one user to access one buffer at a certain moment. When one user access to the first buffer to record operation information, the buffer will be locked, then the system sends the operation information in first buffer to the second buffer and unlocks the first buffer. After the first buffer is unlocked, other user can access the first buffer, and the second buffer can write these information into the log file. Because the 406

5 DDSF:ADataDeduplicationSystemFrameworkforCloudEnvironments I/O operation is performed in the second buffer, the second user can immediately write operation log when the previous user leaves from the first buffer, so it can greatly improve the response speed for multiple users to write log file. 3.2 Client Component Client component is a tool through which user accesses to the cloud storage system. The client consists of file chunking module and chunk merging module. File Chunking Module: In this module, a file is broken down into chunks of fixed size or unfixed size, and the hash values for each chunk are calculated using both MD5 and SHA-1 algorithm. Then the client sends the command of storing file and the file chunk s hash value list to the fingerprint server component. The fingerprint server component creates a new file in the IFS to store the metadata of user file, and searches the hash table to test which chunk already exists according to the list of the hash values. The fingerprint server component will inform the client component to send the chunks which have not been duplicated to HDFS directly. Chunk Merging Module: When user wants to download a file from cloud, this module will consult the fingerprint server component to obtain all chunks address of this file, and then directly access to HDFS to get chunks block by block according to these chunks addresses, eventually this module merges these chunks into a file. In all, the client component can know whether the chunk is repeated before chunk transmission and only sends the unrepeated chunk, which can reduce traffic and bandwidth consumption. 4 EXPERIMENTS Our experiment platform was set up on 6 computers. Four computers serve as the cloud storage component, including one NameNode and three DataNodes. The fifth computer serves as the fingerprint server component, and the last one serves as the client component. The detailed configuration information is listed in Table 1. Table 1: Configuration of Six Computers. Hosts CPU RAM Hard Disk IP Address NameNode Intel(R) Xeon(R) 8GB 120GB DataNode1 DataNode2 DataNode3 Hash Server Client Intel(R) Xeon(R) Intel(R) Xeon(R) Intel(R) Xeon(R) Intel(R) Xeon(R) IntelCore(TM)2 Duo 4.1 File Consistency 8GB 120GB GB 120GB GB 120GB GB 120GB GB 120GB In our experiment, we uploaded some files into DDSF, then downloaded these files and compared the downloaded files with the source files to check whether these files are consistent after deduplication. These files are built for a special purpose and consist of pictures, audio, binary, text and video files. We completed three experiments and set the data block size to 1M. Details of the source file size, the space used after deduplication and file consistency are given in Table 2. Test Number of Files Table 2: File Consistency. Source File Size Size after Deduplication File Consistency MB 87MB Y MB 425MB Y GB 726MB Y According to the results shown in Table 2, three tests have a different number of file, file type, file size and repetition rate, and all files are consistent in DDSF. Therefore our framework can maintain file consistency and integrity regardless of the file type and size. 4.2 Writing Efficiency In this part, we will discuss the writing efficiency with and without DDFS. We used 6 groups of data sets and the data repetition rates were 0%, 20%, 40%, 60%, 80% and 100% respectively. The size of each group was 2.3G and the chunk size was 10M. We uploaded each data set and measured its loading time with and without DDSF. The time it takes to upload files with and without DDSF is shown in Figure

6 CLOSER2014-4thInternationalConferenceonCloudComputingandServicesScience In figure 3, data repetition rate is plotted on the -axis and time taken to save the file is plotted on the -axis. As shown in the figure 3, with data repetition rate increasing, the time taken to save files with DDSF reduces rapidly. When the data repetition rate is 5%, the time taken to save the same files with DDSF and without DDSF is the about same. block size, amount of data, the number of hash table s record and the number of repeated record. Figure 4: Time to Read files with and without DDSF. Figure 3: Time to Save files with and without DDSF. 4.3 Reading Efficiency This part introduces the results of reading efficiency with and without DDSF. We separately download the previously stored data sets and measured each download time. The results are shown in figure 4. As shown in figure 4, the average time taken to read files without DDSF is about 253s, and the time with DDSF is about 260s. The time increases by only 2.4% with DDSF than without DDSF. As mentioned before, there is an average of 60% of duplicated data in Dropbox. If using DDSF, we can save nearly 60% of storage space, although the time taken to read files will increase by 2.4%, whereas the time taken to upload files will reduce by 56.7%. In summary, the performance of the entire storage system can be largely improved by using DDSF. 4.4 Repetition Rate In this section, we discuss the relationship among different block size, the amount of data and the data repetition rate. The experimental files are extracted from eight computers in our lab, and the total size of the data is 512GB. The block size has two levels, which are KB level and MB level. Each level is divided into three sub-levels which are 8KB, 16KB, 64KB, 1MB, 2MB, and 4MB. The amounts of data are 16GB, 32GB, 64GB, 128GB, 256GB and 512GB. Figure 5 shows the relationship between As shown in figure 5-1, in a certain amount of data, the number of hash table s record will increase with the decrease of the block size. If the block size is 8KB and the amount of data is 512GB, the number of hash table s record is That number will be 10 5 if the block size is 4MB. Meanwhile, the size of hash table is liner growth with the increase of data. In figure 5-2, the number of repeated hash value will increase when the amount of data increases or the block size reduces. The repetition rates are shown in table 3. BS denotes Block Size, RR denotes Repetitive Rate, AOD denotes Amount of Data. With the growth of block size, the repetition rate decreases. The repetition rate decreases with the growth of block size. This is because the repetition rate is lower with the larger block size. Because the growth rate of the number of block is higher than that of the number of repeated hash entry, the repetition rate will reduce when amount of data increase. Repetition_Rate = NRHE / NB (1) where NRHE denotes the number of repeated hash entry and NB denotes the number of block. RR BS Table 3: Repetition rates. AOD 8KB 16KB 64KB 1M 2MB 4MB 16G 38.4% 38.0% 36.9% 34.4% 33.2% 31.3% 32G 28.6% 28.2% 27.3% 25.1% 23.8% 22.6% 64G 22.9% 22.1% 20.8% 16.5% 15.4% 14.2% 128G 17.5% 16.9% 15.2% 12.6% 11.7% 11.0% 256G 15.9% 14.6% 14.1% 13.2% 10.9% 9.4% 512G 14.9% 13.0% 12.8% 11.2% 10.3% 8.7% 408

7 DDSF:ADataDeduplicationSystemFrameworkforCloudEnvironments Figure 5.1. Figure 5.2. Figure 5.3. Figure Indexing Efficiency We will discuss the efficiency of the two-level indexing mechanism. We use five different hash tables obtained in the section 4.4 and measure the time to retrieve hash table. These hash tables are generated in the condition that the chunk size is 8KB and the amounts of data are 16GB, 32GB, 64GB, 128GB and 256GB. The number of hash tables records is , , , and respectively. We do four experiments for each hash table. The results are shown in Figure 6. Figure 6: The efficiency of Indexing. Table 4: The four tests. Test Number of hash values Number in Bloom Filter Number in hash table We use five thousands of hash values to retrieve the hash table in each experiment. The features of these hash values are shown in table 4. In figure 6 shows that, the time in test 2 is much more than that in test 1. Because searching hash values which have not existed in the hash table will need to search the entire slice, which will cost more time. The time in test 3 is less than that in test 1, because some hash values don t need to do further searching if they do not hit in Bloom Filter. 5 CONCLUSIONS In this paper, we propose a deduplicated cloud storage framework---ddsf. The framework is designed to reduce storage space and bandwidth consumption to upload data to cloud. The client component consults the fingerprint server component to determine whether the chunk is repeated before sending the chunk to cloud, and it only sends the unrepeated chunk to the underlying storage system, so it can reduce the amount of communication and improve the storage space utilization. This paper presents an efficient two-level indexing mechanism in order to improve the retrieval efficiency of the hash table. We employ the Bloom Filter and hash table partition mechanism. It divides the hash table into multiple parts and stores these parts in different directory according to a 16- bit value generated by hash value. When searching a hash value, it will only load the slice which has the same 16-bit value with the hash value into memory, so it can further accelerate the retrieval speed and improve efficiency. This paper also introduces the 409

8 CLOSER2014-4thInternationalConferenceonCloudComputingandServicesScience system image and operation log. The system image preserves the directory structure of the entire system, and the operation log records user operations on directory, so the robustness of the IFS can be enhanced. The retrieval of hash tables is the system bottleneck. In future work, we will develop more efficient indexing mechanism to reduce the indexing time so as to further optimize performance. ACKNOWLEDGEMENTS This research is supported by the project of Science and Technology Department of Shaanxi Province and by the National High-tech R&D Program of China (863 Program) under Grant No.2009AA01Z142. Technologies, ed. Monterey, CA: USENIX Association, 2002, pp D. Bhagwat, Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup. in 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems Mascots, 2009, pp T. C. Austin, Decentralized deduplication in SAN cluster file systems. in Proceedings of the 2009 conference on USENIX Annual technical conference, San Diego, CaJifornia. 2009, pp Zhe SUN, DeDu: Building a Deduplication Storage System over Cloud Computing. in Proceedings of the th International Conference on Computer Supported Cooperative Work in Design(CSCWD), pp B. Bloom, Space/Time Tradeoffs in Hash Coding with Allowable Errors. Communications of the ACM, 13(7): , REFERENCES Nicolas Bonvin, A self-organized, fault-tolerant and scalable replication scheme for cloud storage[a]. Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC'10. New York: ACM Press, ~216 C. Soghoian, How dropbox sacrifices user privacy for cost savings. 04/how-dropbox-sacrifices-userprivacy-for.html, April 2011, online. Maddodi, S., Data Deduplication Techniques and Analysis[A]. In Proceedings-3rd International Conference on Emerging Trends in Engineering and Technology, ICETET New Jersey: IEEE Compute Society Press, ~668. V. Henson, An analysis of compare-by-hash. in Proceedings of the 9th conference on Hot Topics in Operating Systems Lihue, Hawaii, 2003, pp J. Black, Compare-by-hash: A reasoned analysis. in USENIX Association Proceedings of the 2006 USENIX Annual Technical Conference, 2006, pp D. Borthakur, The Hadoop Distributed File System: Architecture and Design URL: apache.org/hdfs/docs/current Ihdfs_design.pdf. Benjamin Zhu, Avoiding the disk bottleneck in the data domain deduplication file system[a]. In Proc. of the 6th Usenix Conf. on File and Storage Technologies (FAST 2008). Berkeley: USENIX Association, Broder A. Z., Mitzenmacher M, Network applications of bloom filters: A survey. Internet Mathematics, 2003, 1(4): Q. Sean, D. Sean, Venti: A New Approach to Archival Data Storage. In Proceedings of the 1st USENfX Conference on File and Storage 410

New research on Key Technologies of unstructured data cloud storage

New research on Key Technologies of unstructured data cloud storage 2017 International Conference on Computing, Communications and Automation(I3CA 2017) New research on Key Technologies of unstructured data cloud storage Songqi Peng, Rengkui Liua, *, Futian Wang State

More information

Deduplication Storage System

Deduplication Storage System Deduplication Storage System Kai Li Charles Fitzmorris Professor, Princeton University & Chief Scientist and Co-Founder, Data Domain, Inc. 03/11/09 The World Is Becoming Data-Centric CERN Tier 0 Business

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

Processing Technology of Massive Human Health Data Based on Hadoop

Processing Technology of Massive Human Health Data Based on Hadoop 6th International Conference on Machinery, Materials, Environment, Biotechnology and Computer (MMEBC 2016) Processing Technology of Massive Human Health Data Based on Hadoop Miao Liu1, a, Junsheng Yu1,

More information

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Alternative Approaches for Deduplication in Cloud Storage Environment

Alternative Approaches for Deduplication in Cloud Storage Environment International Journal of Computational Intelligence Research ISSN 0973-1873 Volume 13, Number 10 (2017), pp. 2357-2363 Research India Publications http://www.ripublication.com Alternative Approaches for

More information

SHHC: A Scalable Hybrid Hash Cluster for Cloud Backup Services in Data Centers

SHHC: A Scalable Hybrid Hash Cluster for Cloud Backup Services in Data Centers 2011 31st International Conference on Distributed Computing Systems Workshops SHHC: A Scalable Hybrid Hash Cluster for Cloud Backup Services in Data Centers Lei Xu, Jian Hu, Stephen Mkandawire and Hong

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Understanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp.

Understanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp. Understanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp. Primary Storage Optimization Technologies that let you store more data on the same storage Thin provisioning Copy-on-write

More information

Reducing The De-linearization of Data Placement to Improve Deduplication Performance

Reducing The De-linearization of Data Placement to Improve Deduplication Performance Reducing The De-linearization of Data Placement to Improve Deduplication Performance Yujuan Tan 1, Zhichao Yan 2, Dan Feng 2, E. H.-M. Sha 1,3 1 School of Computer Science & Technology, Chongqing University

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

The Google File System. Alexandru Costan

The Google File System. Alexandru Costan 1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems

More information

Deduplication File System & Course Review

Deduplication File System & Course Review Deduplication File System & Course Review Kai Li 12/13/13 Topics u Deduplication File System u Review 12/13/13 2 Storage Tiers of A Tradi/onal Data Center $$$$ Mirrored storage $$$ Dedicated Fibre Clients

More information

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Enhanced Hadoop with Search and MapReduce Concurrency Optimization Volume 114 No. 12 2017, 323-331 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Enhanced Hadoop with Search and MapReduce Concurrency Optimization

More information

Online Version Only. Book made by this file is ILLEGAL. Design and Implementation of Binary File Similarity Evaluation System. 1.

Online Version Only. Book made by this file is ILLEGAL. Design and Implementation of Binary File Similarity Evaluation System. 1. , pp.1-10 http://dx.doi.org/10.14257/ijmue.2014.9.1.01 Design and Implementation of Binary File Similarity Evaluation System Sun-Jung Kim 2, Young Jun Yoo, Jungmin So 1, Jeong Gun Lee 1, Jin Kim 1 and

More information

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1 International Conference on Intelligent Systems Research and Mechatronics Engineering (ISRME 2015) The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian

More information

A Review on Backup-up Practices using Deduplication

A Review on Backup-up Practices using Deduplication Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 9, September 2015,

More information

Rethinking Deduplication Scalability

Rethinking Deduplication Scalability Rethinking Deduplication Scalability Petros Efstathopoulos Petros Efstathopoulos@symantec.com Fanglu Guo Fanglu Guo@symantec.com Symantec Research Labs Symantec Corporation, Culver City, CA, USA 1 ABSTRACT

More information

EaSync: A Transparent File Synchronization Service across Multiple Machines

EaSync: A Transparent File Synchronization Service across Multiple Machines EaSync: A Transparent File Synchronization Service across Multiple Machines Huajian Mao 1,2, Hang Zhang 1,2, Xianqiang Bao 1,2, Nong Xiao 1,2, Weisong Shi 3, and Yutong Lu 1,2 1 State Key Laboratory of

More information

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c 2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016) ISBN: 978-1-60595-379-3 Dynamic

More information

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. February 22, Prof. Joe Pasquale

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. February 22, Prof. Joe Pasquale CSE 120: Principles of Operating Systems Lecture 10 File Systems February 22, 2006 Prof. Joe Pasquale Department of Computer Science and Engineering University of California, San Diego 2006 by Joseph Pasquale

More information

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 DOI:10.19026/ajfst.5.3106 ISSN: 2042-4868; e-issn: 2042-4876 2013 Maxwell Scientific Publication Corp. Submitted: May 29, 2013 Accepted:

More information

DEBAR: A Scalable High-Performance Deduplication Storage System for Backup and Archiving

DEBAR: A Scalable High-Performance Deduplication Storage System for Backup and Archiving University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln CSE Technical reports Computer Science and Engineering, Department of 1-5-29 DEBAR: A Scalable High-Performance Deduplication

More information

An Efficient Provable Data Possession Scheme based on Counting Bloom Filter for Dynamic Data in the Cloud Storage

An Efficient Provable Data Possession Scheme based on Counting Bloom Filter for Dynamic Data in the Cloud Storage , pp. 9-16 http://dx.doi.org/10.14257/ijmue.2016.11.4.02 An Efficient Provable Data Possession Scheme based on Counting Bloom Filter for Dynamic Data in the Cloud Storage Eunmi Jung 1 and Junho Jeong 2

More information

A Privacy Preserving Model for Ownership Indexing in Distributed Storage Systems

A Privacy Preserving Model for Ownership Indexing in Distributed Storage Systems A Privacy Preserving Model for Ownership Indexing in Distributed Storage Systems Tiejian Luo tjluo@ucas.ac.cn Zhu Wang wangzhubj@gmail.com Xiang Wang wangxiang11@mails.ucas.ac.cn ABSTRACT The indexing

More information

CLIENT DATA NODE NAME NODE

CLIENT DATA NODE NAME NODE Volume 6, Issue 12, December 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Efficiency

More information

Efficient Resource Management for the P2P Web Caching

Efficient Resource Management for the P2P Web Caching Efficient Resource Management for the P2P Web Caching Kyungbaek Kim and Daeyeon Park Department of Electrical Engineering & Computer Science, Division of Electrical Engineering, Korea Advanced Institute

More information

A New HadoopBased Network Management System with Policy Approach

A New HadoopBased Network Management System with Policy Approach Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,

More information

File Systems. Before We Begin. So Far, We Have Considered. Motivation for File Systems. CSE 120: Principles of Operating Systems.

File Systems. Before We Begin. So Far, We Have Considered. Motivation for File Systems. CSE 120: Principles of Operating Systems. CSE : Principles of Operating Systems Lecture File Systems February, 6 Before We Begin Read Chapters and (File Systems) Prof. Joe Pasquale Department of Computer Science and Engineering University of California,

More information

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-16 Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

DATA DEDUPLCATION AND MIGRATION USING LOAD REBALANCING APPROACH IN HDFS Pritee Patil 1, Nitin Pise 2,Sarika Bobde 3 1

DATA DEDUPLCATION AND MIGRATION USING LOAD REBALANCING APPROACH IN HDFS Pritee Patil 1, Nitin Pise 2,Sarika Bobde 3 1 DATA DEDUPLCATION AND MIGRATION USING LOAD REBALANCING APPROACH IN HDFS Pritee Patil 1, Nitin Pise 2,Sarika Bobde 3 1 Department of Computer Engineering 2 Department of Computer Engineering Maharashtra

More information

High Throughput WAN Data Transfer with Hadoop-based Storage

High Throughput WAN Data Transfer with Hadoop-based Storage High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San

More information

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. November 6, Prof. Joe Pasquale

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. November 6, Prof. Joe Pasquale CSE 120: Principles of Operating Systems Lecture 10 File Systems November 6, 2003 Prof. Joe Pasquale Department of Computer Science and Engineering University of California, San Diego 2003 by Joseph Pasquale

More information

Byte Index Chunking Approach for Data Compression

Byte Index Chunking Approach for Data Compression Ider Lkhagvasuren 1, Jung Min So 1, Jeong Gun Lee 1, Chuck Yoo 2, Young Woong Ko 1 1 Dept. of Computer Engineering, Hallym University Chuncheon, Korea {Ider555, jso, jeonggun.lee, yuko}@hallym.ac.kr 2

More information

Deploying De-Duplication on Ext4 File System

Deploying De-Duplication on Ext4 File System Deploying De-Duplication on Ext4 File System Usha A. Joglekar 1, Bhushan M. Jagtap 2, Koninika B. Patil 3, 1. Asst. Prof., 2, 3 Students Department of Computer Engineering Smt. Kashibai Navale College

More information

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Yuval Carmel Tel-Aviv University Advanced Topics in Storage Systems - Spring 2013 Yuval Carmel Tel-Aviv University "Advanced Topics in About & Keywords Motivation & Purpose Assumptions Architecture overview & Comparison Measurements How does it fit in? The Future 2 About & Keywords

More information

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Fall 2009 Lecture-19 CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Software-defined Storage: Fast, Safe and Efficient

Software-defined Storage: Fast, Safe and Efficient Software-defined Storage: Fast, Safe and Efficient TRY NOW Thanks to Blockchain and Intel Intelligent Storage Acceleration Library Every piece of data is required to be stored somewhere. We all know about

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

Preliminary Research on Distributed Cluster Monitoring of G/S Model

Preliminary Research on Distributed Cluster Monitoring of G/S Model Available online at www.sciencedirect.com Physics Procedia 25 (2012 ) 860 867 2012 International Conference on Solid State Devices and Materials Science Preliminary Research on Distributed Cluster Monitoring

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information Subject 10 Fall 2015 Google File System and BigTable and tiny bits of HDFS (Hadoop File System) and Chubby Not in textbook; additional information Disclaimer: These abbreviated notes DO NOT substitute

More information

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler MSST 10 Hadoop in Perspective Hadoop scales computation capacity, storage capacity, and I/O bandwidth by

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

! Design constraints.  Component failures are the norm.  Files are huge by traditional standards. ! POSIX-like Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total

More information

Distributed Face Recognition Using Hadoop

Distributed Face Recognition Using Hadoop Distributed Face Recognition Using Hadoop A. Thorat, V. Malhotra, S. Narvekar and A. Joshi Dept. of Computer Engineering and IT College of Engineering, Pune {abhishekthorat02@gmail.com, vinayak.malhotra20@gmail.com,

More information

Bloom Filters. References:

Bloom Filters. References: Bloom Filters References: Li Fan, Pei Cao, Jussara Almeida, Andrei Broder, Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol, IEEE/ACM Transactions on Networking, Vol. 8, No. 3, June 2000.

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Efficient Metadata Management in Cloud Computing

Efficient Metadata Management in Cloud Computing Send Orders for Reprints to reprints@benthamscience.ae The Open Cybernetics & Systemics Journal, 2015, 9, 1485-1489 1485 Efficient Metadata Management in Cloud Computing Open Access Yu Shuchun 1,* and

More information

The Fusion Distributed File System

The Fusion Distributed File System Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique

More information

LOAD BALANCING AND DEDUPLICATION

LOAD BALANCING AND DEDUPLICATION LOAD BALANCING AND DEDUPLICATION Mr.Chinmay Chikode Mr.Mehadi Badri Mr.Mohit Sarai Ms.Kshitija Ubhe ABSTRACT Load Balancing is a method of distributing workload across multiple computing resources such

More information

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016 ISSN

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016 ISSN 68 Improving Access Efficiency of Small Files in HDFS Monica B. Bisane, Student, Department of CSE, G.C.O.E, Amravati,India, monica9.bisane@gmail.com Asst.Prof. Pushpanjali M. Chouragade, Department of

More information

Decision analysis of the weather log by Hadoop

Decision analysis of the weather log by Hadoop Advances in Engineering Research (AER), volume 116 International Conference on Communication and Electronic Information Engineering (CEIE 2016) Decision analysis of the weather log by Hadoop Hao Wu Department

More information

Quality of Service Enhancement by Using an Integer Bloom Filter Based Data Deduplication Mechanism in the Cloud Storage Environment

Quality of Service Enhancement by Using an Integer Bloom Filter Based Data Deduplication Mechanism in the Cloud Storage Environment Quality of Service Enhancement by Using an Integer Bloom Filter Based Data Deduplication Mechanism in the Cloud Storage Environment Kuo-Qin Yan, Yung-Hsiang Su, Hsin-Met Chuan, Shu-Ching Wang, Bo-Wei Chen

More information

dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD)

dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD) University Paderborn Paderborn Center for Parallel Computing Technical Report dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD) Dirk Meister Paderborn Center for Parallel Computing

More information

Google File System. Arun Sundaram Operating Systems

Google File System. Arun Sundaram Operating Systems Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Single Instance Storage Strategies

Single Instance Storage Strategies Single Instance Storage Strategies Michael Fahey, Hitachi Data Systems SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individuals may use this

More information

A New Model of Search Engine based on Cloud Computing

A New Model of Search Engine based on Cloud Computing A New Model of Search Engine based on Cloud Computing DING Jian-li 1,2, YANG Bo 1 1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China 2. Tianjin Key

More information

Optimization Scheme for Small Files Storage Based on Hadoop Distributed File System

Optimization Scheme for Small Files Storage Based on Hadoop Distributed File System , pp.241-254 http://dx.doi.org/10.14257/ijdta.2015.8.5.21 Optimization Scheme for Small Files Storage Based on Hadoop Distributed File System Yingchi Mao 1, 2, Bicong Jia 1, Wei Min 1 and Jiulong Wang

More information

Striped Grid Files: An Alternative for Highdimensional

Striped Grid Files: An Alternative for Highdimensional Striped Grid Files: An Alternative for Highdimensional Indexing Thanet Praneenararat 1, Vorapong Suppakitpaisarn 2, Sunchai Pitakchonlasap 1, and Jaruloj Chongstitvatana 1 Department of Mathematics 1,

More information

A Comparative Survey on Big Data Deduplication Techniques for Efficient Storage System

A Comparative Survey on Big Data Deduplication Techniques for Efficient Storage System A Comparative Survey on Big Data Techniques for Efficient Storage System Supriya Milind More Sardar Patel Institute of Technology Kailas Devadkar Sardar Patel Institute of Technology ABSTRACT - Nowadays

More information

ENCRYPTED DATA MANAGEMENT WITH DEDUPLICATION IN CLOUD COMPUTING

ENCRYPTED DATA MANAGEMENT WITH DEDUPLICATION IN CLOUD COMPUTING ENCRYPTED DATA MANAGEMENT WITH DEDUPLICATION IN CLOUD COMPUTING S KEERTHI 1*, MADHAVA REDDY A 2* 1. II.M.Tech, Dept of CSE, AM Reddy Memorial College of Engineering & Technology, Petlurivaripalem. 2. Assoc.

More information

Speeding Up Cloud/Server Applications Using Flash Memory

Speeding Up Cloud/Server Applications Using Flash Memory Speeding Up Cloud/Server Applications Using Flash Memory Sudipta Sengupta and Jin Li Microsoft Research, Redmond, WA, USA Contains work that is joint with Biplob Debnath (Univ. of Minnesota) Flash Memory

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

CSE-E5430 Scalable Cloud Computing Lecture 9

CSE-E5430 Scalable Cloud Computing Lecture 9 CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

PPMS: A Peer to Peer Metadata Management Strategy for Distributed File Systems

PPMS: A Peer to Peer Metadata Management Strategy for Distributed File Systems PPMS: A Peer to Peer Metadata Management Strategy for Distributed File Systems Di Yang, Weigang Wu, Zhansong Li, Jiongyu Yu, and Yong Li Department of Computer Science, Sun Yat-sen University Guangzhou

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

An Efficient Distributed B-tree Index Method in Cloud Computing

An Efficient Distributed B-tree Index Method in Cloud Computing Send Orders for Reprints to reprints@benthamscience.ae The Open Cybernetics & Systemics Journal, 214, 8, 32-38 32 Open Access An Efficient Distributed B-tree Index Method in Cloud Computing Huang Bin 1,*

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES Al-Badarneh et al. Special Issue Volume 2 Issue 1, pp. 200-213 Date of Publication: 19 th December, 2016 DOI-https://dx.doi.org/10.20319/mijst.2016.s21.200213 INDEX-BASED JOIN IN MAPREDUCE USING HADOOP

More information

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Kefei Wang and Feng Chen Louisiana State University SoCC '18 Carlsbad, CA Key-value Systems in Internet Services Key-value

More information

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

ChunkStash: Speeding Up Storage Deduplication using Flash Memory ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath +, Sudipta Sengupta *, Jin Li * * Microsoft Research, Redmond (USA) + Univ. of Minnesota, Twin Cities (USA) Deduplication

More information

Ceph: A Scalable, High-Performance Distributed File System

Ceph: A Scalable, High-Performance Distributed File System Ceph: A Scalable, High-Performance Distributed File System S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long Presented by Philip Snowberger Department of Computer Science and Engineering University

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

Design and Implementation of Various File Deduplication Schemes on Storage Devices

Design and Implementation of Various File Deduplication Schemes on Storage Devices Design and Implementation of Various File Deduplication Schemes on Storage Devices Yong-Ting Wu, Min-Chieh Yu, Jenq-Shiou Leu Department of Electronic and Computer Engineering National Taiwan University

More information

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 HDFS Architecture Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 Based Upon: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Assumptions At scale, hardware

More information

Subway : Peer-To-Peer Clustering of Clients for Web Proxy

Subway : Peer-To-Peer Clustering of Clients for Web Proxy Subway : Peer-To-Peer Clustering of Clients for Web Proxy Kyungbaek Kim and Daeyeon Park Department of Electrical Engineering & Computer Science, Division of Electrical Engineering, Korea Advanced Institute

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi Journal of Energy and Power Engineering 10 (2016) 405-410 doi: 10.17265/1934-8975/2016.07.004 D DAVID PUBLISHING Shirin Abbasi Computer Department, Islamic Azad University-Tehran Center Branch, Tehran

More information

Strategies for Single Instance Storage. Michael Fahey Hitachi Data Systems

Strategies for Single Instance Storage. Michael Fahey Hitachi Data Systems Strategies for Single Instance Storage Michael Fahey Hitachi Data Systems Abstract Single Instance Strategies for Storage Single Instance Storage has become a very popular topic in the industry because

More information

HTRC Data API Performance Study

HTRC Data API Performance Study HTRC Data API Performance Study Yiming Sun, Beth Plale, Jiaan Zeng Amazon Indiana University Bloomington {plale, jiaazeng}@cs.indiana.edu Abstract HathiTrust Research Center (HTRC) allows users to access

More information

HYDRAstor: a Scalable Secondary Storage

HYDRAstor: a Scalable Secondary Storage HYDRAstor: a Scalable Secondary Storage 7th TF-Storage Meeting September 9 th 00 Łukasz Heldt Largest Japanese IT company $4 Billion in annual revenue 4,000 staff www.nec.com Polish R&D company 50 engineers

More information

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Hwajung Lee Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Cloud Introduction Cloud Service Model Big Data Hadoop MapReduce HDFS (Hadoop Distributed

More information

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. X, XXXXX Boafft: Distributed Deduplication for Big Data Storage in the Cloud

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. X, XXXXX Boafft: Distributed Deduplication for Big Data Storage in the Cloud TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. X, XXXXX 2016 1 Boafft: Distributed Deduplication for Big Data Storage in the Cloud Shengmei Luo, Guangyan Zhang, Chengwen Wu, Samee U. Khan, Senior Member,,

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP 03, October 19 22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae Woo Memory & Storage Architecture Lab. School

More information

International Journal of Advance Engineering and Research Development. Duplicate File Searcher and Remover

International Journal of Advance Engineering and Research Development. Duplicate File Searcher and Remover Scientific Journal of Impact Factor (SJIF): 4.72 International Journal of Advance Engineering and Research Development Volume 4, Issue 4, April -2017 Duplicate File Searcher and Remover Remover Of Duplicate

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

HP Data Protector 9.0 Deduplication

HP Data Protector 9.0 Deduplication Technical white paper HP Data Protector 9.0 Deduplication Introducing Backup to Disk devices and deduplication Table of contents Summary 3 Overview 3 When to use deduplication 4 Advantages of B2D devices

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage

Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage White Paper Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage What You Will Learn A Cisco Tetration Analytics appliance bundles computing, networking, and storage resources in one

More information

MAPR TECHNOLOGIES, INC. TECHNICAL BRIEF APRIL 2017 MAPR SNAPSHOTS

MAPR TECHNOLOGIES, INC. TECHNICAL BRIEF APRIL 2017 MAPR SNAPSHOTS MAPR TECHNOLOGIES, INC. TECHNICAL BRIEF APRIL 2017 MAPR SNAPSHOTS INTRODUCTION The ability to create and manage snapshots is an essential feature expected from enterprise-grade storage systems. This capability

More information