The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c

Similar documents
Research on Full-text Retrieval based on Lucene in Enterprise Content Management System Lixin Xu 1, a, XiaoLin Fu 2, b, Chunhua Zhang 1, c

A New Model of Search Engine based on Cloud Computing

Processing Technology of Massive Human Health Data Based on Hadoop

Research Of Data Model In Engineering Flight Simulation Platform Based On Meta-Data Liu Jinxin 1,a, Xu Hong 1,b, Shen Weiqun 2,c

The Analysis and Research of IPTV Set-top Box System. Fangyan Bai 1, Qi Sun 2

The Analysis and Implementation of the K - Means Algorithm Based on Hadoop Platform

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

New research on Key Technologies of unstructured data cloud storage

Optimization Scheme for Small Files Storage Based on Hadoop Distributed File System

Hadoop and HDFS Overview. Madhu Ankam

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

The Analysis of the Loss Rate of Information Packet of Double Queue Single Server in Bi-directional Cable TV Network

Decision analysis of the weather log by Hadoop

A Digital Menu System Based on the Cloud client Technology Lin Dong 1, a, Weibo Li 1, b, Ping He 2,c,Jia Liu 1,d

Research and Improvement of Apriori Algorithm Based on Hadoop

The Application Analysis and Network Design of wireless VPN for power grid. Wang Yirong,Tong Dali,Deng Wei

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Research on Mass Image Storage Platform Based on Cloud Computing

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016 ISSN

Construction of the Library Management System Based on Data Warehouse and OLAP Maoli Xu 1, a, Xiuying Li 2,b

High Performance Computing on MapReduce Programming Framework

Design and Implementation of CNC Operator Panel Control Functions Based on CPLD. Huaqun Zhan, Bin Xu

Customizing dynamic libraries of Qt based on the embedded Linux Li Yang 1,a, Wang Yunliang 2,b

Serial Communication Based on LabVIEW for the Development of an ECG Monitor

Design and Implementation of unified Identity Authentication System Based on LDAP in Digital Campus

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

IMPLEMENTATION OF INFORMATION RETRIEVAL (IR) ALGORITHM FOR CLOUD COMPUTING: A COMPARATIVE STUDY BETWEEN WITH AND WITHOUT MAPREDUCE MECHANISM *

Hadoop. copyright 2011 Trainologic LTD

Shape Optimization Design of Gravity Buttress of Arch Dam Based on Asynchronous Particle Swarm Optimization Method. Lei Xu

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

An Algorithm of Association Rule Based on Cloud Computing

Utilizing Restricted Direction Strategy and Binary Heap Technology to Optimize Dijkstra Algorithm in WebGIS

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

A priority based dynamic bandwidth scheduling in SDN networks 1

Realization of Automatic Keystone Correction for Smart mini Projector Projection Screen

Construction of SSI Framework Based on MVC Software Design Model Yongchang Rena, Yongzhe Mab

Research on the Application of Digital Images Based on the Computer Graphics. Jing Li 1, Bin Hu 2

A Novel Architecture to Efficient utilization of Hadoop Distributed File Systems for Small Files

Distributed Face Recognition Using Hadoop

CLIENT DATA NODE NAME NODE

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

HADOOP FRAMEWORK FOR BIG DATA

Hadoop Map Reduce 10/17/2018 1

SQL Query Optimization on Cross Nodes for Distributed System

A Compatible Public Service Platform for Multi-Electronic Certification Authority

Simulation Technology of Light Effect Based on Catia and Workbench Software HongXia Hu

Application of Three-dimensional Visualization Technology in Real Estate Management Jian Cui 1,a, Jiju Ma 2,b, Dongling Ma 1, c and Nana Yang 3,d

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Research on Heterogeneous Communication Network for Power Distribution Automation

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Research of 3D parametric design system of worm drive based on Pro/E. Hongbin Niu a, Xiaohua Li b

Applied Mechanics and Materials Vol

The Research of A multi-language supporting description-oriented Clustering Algorithm on Meta-Search Engine Result Wuling Ren 1, a and Lijuan Liu 2,b

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

The Research and Design of the Application Domain Building Based on GridGIS

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Constructing an University Scientific Research Management Information System of NET Platform Jianhua Xie 1, a, Jian-hua Xiao 2, b

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX

QADR with Energy Consumption for DIA in Cloud

A Fast and High Throughput SQL Query System for Big Data

The RTP Encapsulation based on Frame Type Method for AVS Video

A brief history on Hadoop

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

UNIT-IV HDFS. Ms. Selva Mary. G

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

Indexing Strategies of MapReduce for Information Retrieval in Big Data

Implementation and performance test of cloud platform based on Hadoop

Design and Implementation of LED Display Screen Controller based on STM32 and FPGA Chi Zhang 1,a, Xiaoguang Wu 1,b and Chengjun Zhang 1,c

Analyzing and Improving Load Balancing Algorithm of MooseFS

Application of Individualized Service System for Scientific and Technical Literature In Colleges and Universities

A Template-Matching-Based Fast Algorithm for PCB Components Detection Haiming Yin

, ,China. Keywords: CAN BUS,Environmental Factors,Data Collection,Roll Call.

Introduction to MapReduce

Batch Inherence of Map Reduce Framework

Hadoop File Management System

DRA AUDIO CODING STANDARD

Study and Design of CAN / LIN Hybrid Network of Automotive Body. Peng Huang

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MapReduce. U of Toronto, 2014

Distributed Systems 16. Distributed File Systems II

Study on the Quantitative Vulnerability Model of Information System based on Mathematical Modeling Techniques. Yunzhi Li

A Review Approach for Big Data and Hadoop Technology

Keywords: Interactive electronic technical manuals; GJB6600; XML markup language; Automatic control equipment

K-means Clustering Optimization Algorithm Based on MapReduce

Mounica B, Aditya Srivastava, Md. Faisal Alam

MI-PDB, MIE-PDB: Advanced Database Systems

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS

50 Must Read Hadoop Interview Questions & Answers

Available online at ScienceDirect. Procedia Computer Science 79 (2016 )

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

MapReduce, Hadoop and Spark. Bompotas Agorakis

Transcription:

Applied Mechanics and Materials Online: 2013-09-27 ISSN: 1662-7482, Vols. 423-426, pp 2733-2736 doi:10.4028/www.scientific.net/amm.423-426.2733 2013 Trans Tech Publications, Switzerland The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c 1, 2, 3 College of Information Engineering, Capital Normal University, Beijing, China a wangyannanme@163.com, b zsd@mail.cnu.edu.cn, c liuhui_cnu@yahoo.com.cn Keywords: HDFS, Small File, Binary Serialization, SequenceFile Abstract. HDFS is a distributed file system designed to access large files, which is inefficient for storing small files. For this issue, a new storage architecture based on the HDFS is designed to solve the problem of low efficiency of HDFS storing small files in this article. This paper mainly uses SequenceFile to merge small files and against to the shortcoming that SequenceFile merges small files, the paper provides the solution and designs a new system structure based on HDFS. The system mainly increases the file judgment unit to mark and identify small files, creates a local index file which is helpful to improve the retrieval efficiency of small files to record the size and offset of the small files and finally uses binary serialization to merge the small files, which makes small files be written into large files as time order. Introduction The cloud computing is not formed a uniform definition by the academic and industrial communities. To a certain extent, we can consider that cloud computing is the commercial development of computing concept including distributed computing, parallel computing, grid computing and so on, which the basic principle is that people use resources on a computer cluster via the Internet [1]. Hadoop is a distributed computing open source framework of Apache open-source organization, which focuses on distributed systems about mass data storage and processing, and provides the MapReduce technology framework implemented in Java, and can deploy distributed applications to the low-cost server. [2].Hadoop massive large files very well, but with the increasing scale of small files to, Hadoop starts to become powerless. Because storing small file needs to repeatedly request the memory address and allocates block. A large number of small files make single NameNode become powerless, and a lot of metadata occupies the NameNode in memory[3]. Therefore, the above problems, this paper designs a distributed file system based on HDFS which used to solve the problem of low the HDFS processing small files. HDFS Architecture Analysis HDFS architecture is based on a large number of ordinary computer configured cluster. Nodes in the cluster are usually running GNU/Linux operating system that must support Java, because the HDFS is implemented in Java. HDFS uses master-slave architecture (Master/Slave), and a cluster has a Master and multiple Slaves, and the former is called the name node (NameNode), and the latter is called data nodes, which is shown in Figure 1. In theory, a single computer can run multiple DataNode process, a NameNode the process (the process is unique throughout the cluster), but in reality, a computer often run a DataNode, or a NameNode [4]. A file is divided into a number of Blocks stored in a set of DataNode. Figure 1.HDFS structure All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of Trans Tech Publications, www.ttp.net. (ID: 130.203.136.75, Pennsylvania State University, University Park, USA-12/05/16,13:46:13)

2734 Applied Materials and Technologies for Modern Manufacturing Problems Which the HDFS Stores Processes Small Files HDFS is designed for large files, storing large files reflects performance advantages, but there is no good way to optimize small files, it is that any block, file or directory in HDFS are stored as objects in memory, and each object takes about memory 150 byte. If there is a ten million small files, NameNode needs 2G space (save two), and if the number of small files increases to 100 million, NameNode need 20G space. Small files consume a lot of memory space of NameNode, which makes NameNode memory capacity severely constrain cluster expansion and its applications. Secondly, accessing to a large number of small files is much faster than accessing to several large files. HDFS was originally developed for streaming accessing to large files, and if a large number of small files are accessed, it needs to constantly jump from one DataNode another DataNode, which seriously affects performance. Finally, it is much faster to handle large files faster than to handle a large number of small files of the same size. Each small file takes up a slot, and the task starts to spend a lot of time and even most of the time-consuming task in the startup task and release. [5]. Related Researches At present, there are three technologies processing small files technologies [6]. HAR Archive Technology [7]. Hadoop Archives (HAR files) file system is a file system that Hadoop provides, which is generally used to archive files. Hadoop Archives (HAR files) File Archive is designed to reduce the namenode memory that large number of small files consumes. HAR file is a special file format. A HAR file is created by the Hadoop archive commands, and this command is to run a MapReduce task to package a number of smaller files into a HAR file. A HAR file cannot be changed once created, such as to add or delete a file, and client must re-create the archive. SequenceFile Technology. SequenceFile which is a text stored file that consists of the byte stream of binary sequence of key/value can be used in the process of input/output format of map/ reduce [8]. SequenceFile can use a file name as a key, file content as a value. You can write a program to write some small files into a single sequence file then you can use this file directly. But SequenceFile does not establish the appropriate mapping relationship of files to a large file, and if it is not indexed, querying small files needs to traverse the entire SequenceFile to reduce the efficiency of file read. CombineFileInputFormat. The reason that Hadoop is not suitable for processing a large number of small files is that the whole or part of InputSplit which is generated by FileInputFormat is always as the input file. Dealing with a large number of small files, each map operation handles only a small amount of input data, resulting in too many map task operation and reduce overall performance. CombineFileInputFormat is a new the inputformat which can alleviate this problem. It is used to merge multiple files into a single split and Combine FileInputFormat can consider the storage location of the data [9]. Design of Storage Structure In the above description of three methods of resolving small files, some problems exist, and they also need to archive small files in HDFS so that reduce the number of small files, which brings a lot of inconvenience This paper increases judgment module on the basis of the original HDFS. The structure is shown in Figure 2.When a file arrives, at first, the file is determined whether the file is a small file, and if it is, it is given to Merge small files Unit, and if it is not, it is directly uploaded to HDFS. The following is a brief introduction of each part.

Applied Mechanics and Materials Vols. 423-426 2735 Figure 2.The structure of data storage system based on the HDFS Determine The File Unit. The User can make uploading, looking over and downloading data easy and complete other related operations, and it takes into account the needs of non-professional users, which only provides the user a simple business operation and the final valid data. Determine the file type achieves the judgment of the file. Whether the file uploaded is a small file or not, the paper sets a specific threshold. The system sets 1M threshold, and the file whose size is less than 1M is a small file, others are large files. When it is judged as a large file, Determine the file type directly gives the large file uploaded to HDFS client; If it is determined as small files, small file will be transmitted to Merge small files Unit.At first Merge small files Unit will create an index file to record the size and offset of small file. Merge Small Files Unit. The main function of Merge small file unit is to merge small files and generate large files in order to reduce the large number of small files on the Map resource waste. In this unit, in order to more effectively read small files and resolve the low retrieval efficiency when using SequenceFile to merge small files, a local index file is created to store the size and offset of current file. At the same time, for facilitating the storage of small files, this paper uses binary serialization scheme to merge small files and operate small files as time order. Storage Section. The storage section is composed of a large number of low-cost servers, which is a collection of multiple devices. The entire storage layer is composed by a NameNode and multiple DataNodes to complete storage operation of the entire system. The NameNode is responsible for managing namespace of the cluster file system. The DataNode is mainly responsible for data blocks in the storage node and reports status and performs pipeline operations of data copy to the NameNode nodes. System Flowchart. Specific workflow is shown in Figure3: Figure 3.System flowchart

2736 Applied Materials and Technologies for Modern Manufacturing Conclusions This paper analyzes the architecture of HDFS and deficiencies that HDFS deals with small files, and for these shortcomings, this paper improves the design on the basis of distributed file system of the HDFS and designs a new distributed file system based on HDFS that can improve the processing performance for small files. At first, file uploaded is transmitted to Determine the file type, if the file is large, this file is directly given to HDFS, and if the file is small, this file is transmitted to Merge small file unit, then an index file that records the size and offset of the current small file is created, and after a certain period of time, SequenceFile start to merge small files to reduce the number of small files and memory usage of NameNode. Acknowledgment This research was supported by China National Key Technology R&D Program (2012BAH20B03), (2013BAH19F01),(2012BAZ03836).National Nature Science Foundation (31101078), Beijing Nature Science Foundation (4122016), "The computer application technology" Beijing municipal key construction of the discipline, Beijing Engineering Research Center, and Beijing Educational Committee science and technology development plan project (KM201110028018). References [1] Jianguang Deng, Xiaoheng Pan, Huaqiang Yuan, Research of Cloud storage and its Distributed File System, Journal of Dong Guan University of Technology,. vol.19, no.7, pp.41-45, 2012. [2] Weijiao Hao, Shijian Zhou, Dawei Peng, Research of the Cloud GIS Frame with Hadoop Cloud Platform, Jiangxi Science, vol.31, no.1, pp. 109-112, 2013. [3] Dongxue Qin, Study on Processing of Massive Small Files Based on Hadoop, Liaoning University, China, 2011. [4] Chunling Xu, Guangquan Zhang, Comparison and analysis of distributed file system Hadoop HDFS with traditional file system Linux FS, Journal of SuZhou University, vol.30, no.4, pp. 5-9, 2012. [5] Guangyao Zhu, The Hadoop mass processing and analysis of small files, Science and Technology Information, 2012.28. [6] Yannan Wang, Hui Liu, Shudong Zhang, Research of Processing Massive Small Files Based on Hadoop, Journal of Convergence Information Technology, vol.8, no.9, pp.130-137, 2013. [7] http://heipark.iteye.com/blog/1356063 [8] http://blog.csdn.net/flyingpig4/article/details/7579658 [9] Xusheng Hong, Shiping Lin, Efficiency of Storaging Small Files in HDFS Based on MapFile, Computer Systems & Applications, vol.21, no.11, pp.179-182, 2013.

Applied Materials and Technologies for Modern Manufacturing 10.4028/www.scientific.net/AMM.423-426 The Design of Distributed File System Based on HDFS 10.4028/www.scientific.net/AMM.423-426.2733