Applied Mechanics and Materials Online: 2013-09-27 ISSN: 1662-7482, Vols. 423-426, pp 2733-2736 doi:10.4028/www.scientific.net/amm.423-426.2733 2013 Trans Tech Publications, Switzerland The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c 1, 2, 3 College of Information Engineering, Capital Normal University, Beijing, China a wangyannanme@163.com, b zsd@mail.cnu.edu.cn, c liuhui_cnu@yahoo.com.cn Keywords: HDFS, Small File, Binary Serialization, SequenceFile Abstract. HDFS is a distributed file system designed to access large files, which is inefficient for storing small files. For this issue, a new storage architecture based on the HDFS is designed to solve the problem of low efficiency of HDFS storing small files in this article. This paper mainly uses SequenceFile to merge small files and against to the shortcoming that SequenceFile merges small files, the paper provides the solution and designs a new system structure based on HDFS. The system mainly increases the file judgment unit to mark and identify small files, creates a local index file which is helpful to improve the retrieval efficiency of small files to record the size and offset of the small files and finally uses binary serialization to merge the small files, which makes small files be written into large files as time order. Introduction The cloud computing is not formed a uniform definition by the academic and industrial communities. To a certain extent, we can consider that cloud computing is the commercial development of computing concept including distributed computing, parallel computing, grid computing and so on, which the basic principle is that people use resources on a computer cluster via the Internet [1]. Hadoop is a distributed computing open source framework of Apache open-source organization, which focuses on distributed systems about mass data storage and processing, and provides the MapReduce technology framework implemented in Java, and can deploy distributed applications to the low-cost server. [2].Hadoop massive large files very well, but with the increasing scale of small files to, Hadoop starts to become powerless. Because storing small file needs to repeatedly request the memory address and allocates block. A large number of small files make single NameNode become powerless, and a lot of metadata occupies the NameNode in memory[3]. Therefore, the above problems, this paper designs a distributed file system based on HDFS which used to solve the problem of low the HDFS processing small files. HDFS Architecture Analysis HDFS architecture is based on a large number of ordinary computer configured cluster. Nodes in the cluster are usually running GNU/Linux operating system that must support Java, because the HDFS is implemented in Java. HDFS uses master-slave architecture (Master/Slave), and a cluster has a Master and multiple Slaves, and the former is called the name node (NameNode), and the latter is called data nodes, which is shown in Figure 1. In theory, a single computer can run multiple DataNode process, a NameNode the process (the process is unique throughout the cluster), but in reality, a computer often run a DataNode, or a NameNode [4]. A file is divided into a number of Blocks stored in a set of DataNode. Figure 1.HDFS structure All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of Trans Tech Publications, www.ttp.net. (ID: 130.203.136.75, Pennsylvania State University, University Park, USA-12/05/16,13:46:13)
2734 Applied Materials and Technologies for Modern Manufacturing Problems Which the HDFS Stores Processes Small Files HDFS is designed for large files, storing large files reflects performance advantages, but there is no good way to optimize small files, it is that any block, file or directory in HDFS are stored as objects in memory, and each object takes about memory 150 byte. If there is a ten million small files, NameNode needs 2G space (save two), and if the number of small files increases to 100 million, NameNode need 20G space. Small files consume a lot of memory space of NameNode, which makes NameNode memory capacity severely constrain cluster expansion and its applications. Secondly, accessing to a large number of small files is much faster than accessing to several large files. HDFS was originally developed for streaming accessing to large files, and if a large number of small files are accessed, it needs to constantly jump from one DataNode another DataNode, which seriously affects performance. Finally, it is much faster to handle large files faster than to handle a large number of small files of the same size. Each small file takes up a slot, and the task starts to spend a lot of time and even most of the time-consuming task in the startup task and release. [5]. Related Researches At present, there are three technologies processing small files technologies [6]. HAR Archive Technology [7]. Hadoop Archives (HAR files) file system is a file system that Hadoop provides, which is generally used to archive files. Hadoop Archives (HAR files) File Archive is designed to reduce the namenode memory that large number of small files consumes. HAR file is a special file format. A HAR file is created by the Hadoop archive commands, and this command is to run a MapReduce task to package a number of smaller files into a HAR file. A HAR file cannot be changed once created, such as to add or delete a file, and client must re-create the archive. SequenceFile Technology. SequenceFile which is a text stored file that consists of the byte stream of binary sequence of key/value can be used in the process of input/output format of map/ reduce [8]. SequenceFile can use a file name as a key, file content as a value. You can write a program to write some small files into a single sequence file then you can use this file directly. But SequenceFile does not establish the appropriate mapping relationship of files to a large file, and if it is not indexed, querying small files needs to traverse the entire SequenceFile to reduce the efficiency of file read. CombineFileInputFormat. The reason that Hadoop is not suitable for processing a large number of small files is that the whole or part of InputSplit which is generated by FileInputFormat is always as the input file. Dealing with a large number of small files, each map operation handles only a small amount of input data, resulting in too many map task operation and reduce overall performance. CombineFileInputFormat is a new the inputformat which can alleviate this problem. It is used to merge multiple files into a single split and Combine FileInputFormat can consider the storage location of the data [9]. Design of Storage Structure In the above description of three methods of resolving small files, some problems exist, and they also need to archive small files in HDFS so that reduce the number of small files, which brings a lot of inconvenience This paper increases judgment module on the basis of the original HDFS. The structure is shown in Figure 2.When a file arrives, at first, the file is determined whether the file is a small file, and if it is, it is given to Merge small files Unit, and if it is not, it is directly uploaded to HDFS. The following is a brief introduction of each part.
Applied Mechanics and Materials Vols. 423-426 2735 Figure 2.The structure of data storage system based on the HDFS Determine The File Unit. The User can make uploading, looking over and downloading data easy and complete other related operations, and it takes into account the needs of non-professional users, which only provides the user a simple business operation and the final valid data. Determine the file type achieves the judgment of the file. Whether the file uploaded is a small file or not, the paper sets a specific threshold. The system sets 1M threshold, and the file whose size is less than 1M is a small file, others are large files. When it is judged as a large file, Determine the file type directly gives the large file uploaded to HDFS client; If it is determined as small files, small file will be transmitted to Merge small files Unit.At first Merge small files Unit will create an index file to record the size and offset of small file. Merge Small Files Unit. The main function of Merge small file unit is to merge small files and generate large files in order to reduce the large number of small files on the Map resource waste. In this unit, in order to more effectively read small files and resolve the low retrieval efficiency when using SequenceFile to merge small files, a local index file is created to store the size and offset of current file. At the same time, for facilitating the storage of small files, this paper uses binary serialization scheme to merge small files and operate small files as time order. Storage Section. The storage section is composed of a large number of low-cost servers, which is a collection of multiple devices. The entire storage layer is composed by a NameNode and multiple DataNodes to complete storage operation of the entire system. The NameNode is responsible for managing namespace of the cluster file system. The DataNode is mainly responsible for data blocks in the storage node and reports status and performs pipeline operations of data copy to the NameNode nodes. System Flowchart. Specific workflow is shown in Figure3: Figure 3.System flowchart
2736 Applied Materials and Technologies for Modern Manufacturing Conclusions This paper analyzes the architecture of HDFS and deficiencies that HDFS deals with small files, and for these shortcomings, this paper improves the design on the basis of distributed file system of the HDFS and designs a new distributed file system based on HDFS that can improve the processing performance for small files. At first, file uploaded is transmitted to Determine the file type, if the file is large, this file is directly given to HDFS, and if the file is small, this file is transmitted to Merge small file unit, then an index file that records the size and offset of the current small file is created, and after a certain period of time, SequenceFile start to merge small files to reduce the number of small files and memory usage of NameNode. Acknowledgment This research was supported by China National Key Technology R&D Program (2012BAH20B03), (2013BAH19F01),(2012BAZ03836).National Nature Science Foundation (31101078), Beijing Nature Science Foundation (4122016), "The computer application technology" Beijing municipal key construction of the discipline, Beijing Engineering Research Center, and Beijing Educational Committee science and technology development plan project (KM201110028018). References [1] Jianguang Deng, Xiaoheng Pan, Huaqiang Yuan, Research of Cloud storage and its Distributed File System, Journal of Dong Guan University of Technology,. vol.19, no.7, pp.41-45, 2012. [2] Weijiao Hao, Shijian Zhou, Dawei Peng, Research of the Cloud GIS Frame with Hadoop Cloud Platform, Jiangxi Science, vol.31, no.1, pp. 109-112, 2013. [3] Dongxue Qin, Study on Processing of Massive Small Files Based on Hadoop, Liaoning University, China, 2011. [4] Chunling Xu, Guangquan Zhang, Comparison and analysis of distributed file system Hadoop HDFS with traditional file system Linux FS, Journal of SuZhou University, vol.30, no.4, pp. 5-9, 2012. [5] Guangyao Zhu, The Hadoop mass processing and analysis of small files, Science and Technology Information, 2012.28. [6] Yannan Wang, Hui Liu, Shudong Zhang, Research of Processing Massive Small Files Based on Hadoop, Journal of Convergence Information Technology, vol.8, no.9, pp.130-137, 2013. [7] http://heipark.iteye.com/blog/1356063 [8] http://blog.csdn.net/flyingpig4/article/details/7579658 [9] Xusheng Hong, Shiping Lin, Efficiency of Storaging Small Files in HDFS Based on MapFile, Computer Systems & Applications, vol.21, no.11, pp.179-182, 2013.
Applied Materials and Technologies for Modern Manufacturing 10.4028/www.scientific.net/AMM.423-426 The Design of Distributed File System Based on HDFS 10.4028/www.scientific.net/AMM.423-426.2733