Implementation and performance test of cloud platform based on Hadoop

Similar documents
The Analysis and Implementation of the K - Means Algorithm Based on Hadoop Platform

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

BIG DATA TRAINING PRESENTATION

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Decision analysis of the weather log by Hadoop

Hadoop Setup Walkthrough

Apache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Application of Augmented Reality Technology in Workshop Production Management

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Hadoop Setup on OpenStack Windows Azure Guide

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

Comprehensive analysis and evaluation of big data for main transformer equipment based on PCA and Apriority

Research on Computer Network Virtual Laboratory based on ASP.NET. JIA Xuebin 1, a

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Processing Technology of Massive Human Health Data Based on Hadoop

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c

Distributed Face Recognition Using Hadoop

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. HCatalog

Study on the Distributed Crawling for Processing Massive Data in the Distributed Network Environment

Intelligent Control of Micro Grid: A Big Data-Based Control Center

Application of SQL in Data Management about TBM Breaking Test Machine

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Intelligent Education Resource Management System in Cloud Environment

This brief tutorial provides a quick introduction to Big Data, MapReduce algorithm, and Hadoop Distributed File System.

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

Inria, Rennes Bretagne Atlantique Research Center

Data Analysis Using MapReduce in Hadoop Environment

High Performance Computing on MapReduce Programming Framework

Hole repair algorithm in hybrid sensor networks

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Getting Started with Hadoop/YARN

A priority based dynamic bandwidth scheduling in SDN networks 1

The Design and Optimization of Database

Construction and Application of Cloud Data Center in University

Evaluation of Apache Hadoop for parallel data analysis with ROOT

Application of ASP Technology to Realize the Online Administrative License of the earthquake in Hunan Province

Research on Full-text Retrieval based on Lucene in Enterprise Content Management System Lixin Xu 1, a, XiaoLin Fu 2, b, Chunhua Zhang 1, c

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Progress on Efficient Integration of Lustre* and Hadoop/YARN

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

Hadoop Map Reduce 10/17/2018 1

How to Install and Configure Big Data Edition for Hortonworks

Jacquard Control System of Warp Knitting Machine Based on Embedded System

Installation of Hadoop on Ubuntu

Research on Load Balancing and Database Replication based on Linux

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

A Fast and High Throughput SQL Query System for Big Data

VMware vsphere Big Data Extensions Administrator's and User's Guide

BESIII Physical Analysis on Hadoop Platform

CS427 Multicore Architecture and Parallel Computing

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Installation Guide. Community release

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

2017 International Conference on Economics, Management Engineering and Marketing (EMEM 2017) ISBN:

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

A New Method Of VPN Based On LSP Technology

Big Data Retrieving Required Information From Text Files Desmond Hill Yenumula B Reddy (Advisor)

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

Design and Implementation of unified Identity Authentication System Based on LDAP in Digital Campus

Innovatus Technologies

Hadoop On Demand: Configuration Guide

Create Test Environment

Hadoop Based Data Intensive Computation on IAAS Cloud Platforms

Hadoop On Demand User Guide

Chapter 5. The MapReduce Programming Model and Implementation

Big Data for Engineers Spring Resource Management

Application of Co-simulation Technology Based on LabVIEW and Multisim in Electronic Design

Big Data 7. Resource Management

High-Performance Analytics Infrastructure 2.5

Research on Design and Application of Computer Database Quality Evaluation Model

Hadoop Quickstart. Table of contents

On the Expansion of Access Bandwidth of Manufacturing Cloud Core Network

Multi-Node Cluster Setup on Hadoop. Tushar B. Kute,

Hortonworks Data Platform

DIGITAL VIDEO WATERMARKING ON CLOUD COMPUTING ENVIRONMENTS

Getting Started with Spark

Installation and Configuration Document

Installing Hadoop / Yarn, Hive 2.1.0, Scala , and Spark 2.0 on Raspberry Pi Cluster of 3 Nodes. By: Nicholas Propes 2016

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d

Design of EIB-Bus Intelligent Control System Based on Touch Screen Control

Design and Implementation of Digital Library Fanqi Wei, Yan Zhang and Xiaoping Feng

Design of Remote GPRS-based Gas Data Monitoring System

Implementation of RS-485 Communication between PLC and PC of Distributed Control System Based on VB

Unstructured Data Migration and Dump Technology of Large-scale Enterprises

Hands-on Exercise Hadoop

Comparative analyses for the performance of Rational Rose and Visio in software engineering teaching

Preliminary Research on Distributed Cluster Monitoring of G/S Model

Processing Big Data with Hadoop in Azure HDInsight

Configure HOSTNAME by adding the hostname to the file /etc/sysconfig/network. Do the same to all the other 3(4) nodes.

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Qingdao, , China. China. Keywords: Deep sea Ultrahigh pressure, Water sound acquisition, LabVIEW, NI PXle hardware.

The Design Of Private Cloud Platform For Colleges And Universities Education Resources Based On Openstack. Guoxia Zou

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Lightning Protection Performance Assessment of Transmission Line Based on ATP model Automatic Generation

Transcription:

IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Implementation and performance test of cloud platform based on Hadoop To cite this article: Jingxian Xu et al 2018 IOP Conf. Ser.: Earth Environ. Sci. 108 052034 View the article online for updates and enhancements. This content was downloaded from IP address 148.251.232.83 on 19/08/2018 at 06:06

Implementation and performance test of cloud platform based on Hadoop Jingxian Xu, Jianhong Guo* and Chunlan Ren TSL School of Business and Information Technology, Quanzhou Normal University, Quanzhou, China *Corresponding author e-mail: yjsxjx@ 126.com Abstract. Hadoop, as an open source project for the Apache foundation, is a distributed computing framework that deals with large amounts of data and has been widely used in the Internet industry. Therefore, it is meaningful to study the implementation of Hadoop platform and the performance of test platform. The purpose of this subject is to study the method of building Hadoop platform and to study the performance of test platform. This paper presents a method to implement Hadoop platform and a test platform performance method. Experimental results show that the proposed test performance method is effective and it can detect the performance of Hadoop platform. 1. Introduction At present, big data has become the focus of research, there are many solutions for big data. Hadoop is an open source project of the Apache Foundation, which has accumulated a large number of users, and it has also been widely recognized in the industry. Many well-known enterprises have applied Hadoop to their own business domain, these well-known enterprises include Alibaba, Tencent and so on. Hadoop, as an implementation of cloud computing technology, allows users to implement their application logic over the Hadoop framework. One of the goals of cloud computing is to provide high availability computer resources at the lowest possible cost. Hadoop can handle thousands of computer nodes and huge amounts of data, and can automatically handle job scheduling and load balancing, so it's the perfect tool for cloud computing. And Hadoop has the advantages of low cost and high security, so it is very meaningful to study the implementation of Hadoop platform and the performance of the test platform. This paper first studies the implementation of Hadoop platform on the virtual machine, then carries out the benchmark test of Hadoop platform, and finally analyzes the test results. This topic uses the virtual environment to build the cloud platform, this kind of construction way has many advantages, these advantages are to save money and can test the performance of the platform. 2. Implementation of Hadoop Cloud Platform This article introduces the method of installing and configuring Hadoop in the Linux operating system environment. The installed version of the Linux operating system is Centos 6.5, the installed version of Java is 1.7, and the installed version of Hadoop is 2.7. The main steps for installation and configuration are described below: Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. Published under licence by Ltd 1

2.1. Server node planning The cluster consists of 4 nodes: 1 Master and 3 Salve. First of all, the virtual software VMware is installed on the host, and then 4 computer nodes are cloned by Vmare. The 4 nodes are composed of 1 Master nodes and 3 Slave nodes. You can install Master nodes first, then use VMware to clone 3 Slave nodes, and configure 4 nodes at the same time, so that the virtual nodes can communicate with each other. The four nodes are installed with the CentOS 6.5 system and have the same user hadoop. Master nodes are responsible for managing distributed data and decomposing tasks; 3 Salve nodes are responsible for storing distributed data and performing tasks. Specific server node planning is shown in table I. Table 1. server node planning Host Name Running Main Process Assigned IP Installed software Master NameNode, DataNode, Jdk,hadoop DFSZKFailoverController, 192.3.30.1 zookeeper NodeManager,JournalNode Slave1 DataNode,NodeManager, Jdk,hadoop 192.3.40.1 JournalNode,QuorumPeerMain,ResourceManager zookeeper Slave2 DataNode,NodeManager, 192.3.40.2 Jdk,hadoop Slave3 JournalNode,QuorumPeerMain,ResourceManager DataNode,NodeManager, JournalNode,QuorumPeerMain,ResourceManager 192.3.40.3 zookeeper Jdk,hadoop zookeeper 2.2. Configure host In order to successfully build Hadoop cluster and realize the information transfer among nodes in the cluster, it is necessary to modify the hostname and configure the network environment first. Modify the host name to become Master, the command is HOSTNAME=master, and the Slave node settings refer to the above command. In the network environment for configuring Hadoop clusters, you need to add the IP and hostname of all machines in the /etc/hosts file. In this way, the communication between Master and all Slave machines can not only communicate via IP, but also communicate through the host name. Therefore, add the following contents in the end of the "/etc/hosts" file on all machines:192.3.30.1, master, 192.3.40.1, slave1, 192.3.40.2, slave2,192.3.40.3,slave3. 2.3. SSH (Secure Shell) No password login authentication configuration In order to successfully build Hadoop cluster and realize the information transfer among nodes in the cluster, it is necessary to modify the hostname and configure the network environment first. Modify the host name to become Master, the command is HOSTNAME=master, and the Slave node settings refer to the above command. In the network environment for configuring Hadoop clusters, you need to add the IP and hostname of all machines in the /etc/hosts file. In this way, the communication between Master and all Slave machines can not only communicate via IP, but also communicate through the host name. Therefore, add the following contents in the end of the "/etc/hosts" file on all machines: (1) to generate two key files in the Master machine, the two files are id_rsa and id_rsa.pub, they are stored in the memory of the "/home/hadoop/.ssh" directory (2) public key file id_rsa.pub content is appended to the end of the file authorized_keys(3)modify the permissions of authorized_keys(4) copy the authorized_keys file from the Master to the Slave1 node. (5) on the Master node, through the SSH landing 3 other Slave nodes, if you do not need password authentication, then that the Master node no password login settings successfully. (6) configuring SSH on Slave without password logging is the same as Master. After the configuration is successful, the Master node and the 3 Slave nodes can login to each other without the need to enter a password. 2

2.4. Java environment configuration All nodes have to install JDK, first install the JDK at the Master node, and then install the JDK of the other Slave nodes to complete. The following describes the JDK environment configuration. To install JDK and configure environment variables, you need to log in as "root". After the /etc/profile file, add the contents of the three variables of the Java, which are "JAVA_HOME", "CLASSPATH", and "PATH". 2.5. Hadoop cluster configuration All nodes must have Hadoop installed. First, Hadoop is installed at the Master node, and then Hadoop is installed on the other Slave nodes. To install and configure Hadoop, you need to log in as root. To configure Hadoop, you need to upload the hadoop-2.7.tar.gz file to the /usr folder of the Master machine and extract it. Seven configuration files are modified according to the actual situation of the system. These seven files are hadoop-env.sh, yarn-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml. In the seven file, we will focus on configuring the hdfs-site.xml file, which is primarily responsible for setting up HDFS related information. Specific configuration information in the cloud computing platform is shown in table II. Table 2. Hdfs-site.xm configuration information Parameter Value DFS. name.dir /home/hadoop/hadoop-2.7/dfs/name Dfs.replication 3 dfs.blocksize 131072 dfs.namenode.handler.count 20 3. Benchmark tests on the Hadoop platform The test can verify the correctness of the cloud platform and analyze the performance of the cloud platform. Therefore, testing is very important, but testing is often overlooked. In order to have a more comprehensive understanding of the cloud platform, and to find the bottleneck of the cloud platform, and to better improve the performance of the platform, 4 tests are performed on the Hadoop platform. The project intends to test the cloud platform using Hadoop's own testing tools. 3.1. Mrbench test In order to detect the efficiency of small job execution, mrbench performs a small job multiple times. This experiment uses mrbench program to test small jobs, and the result is shown in table III. Table 3. mrbench test results Number of repeated implement 10 20 50 200 500 execution time (s) 15.6 15.49 15.43 15.38 15.36 As you can see from the test results in Figure 1, the average time of job execution becomes steadily and slowly decreasing as the number of small jobs increases. From the results of the last two tests, it can be seen that these two times are not very different. Although the number of times has increased by more than 1 times, the execution time has not been reduced by 1 times. Explains that small jobs executed 500 times have reached the limit of this cluster. 3

Figure 1. Example of a figure caption. 3.2. WordCount test The main function of the WordCount program is to count the number of occurrences of each word in the text file. The WordCount program divides the text files into small files according to certain rules and then inputs them to the Map task. In the Map task, the WordCount program just outputs the frequency of all the different words, and then the Shuffle module and the Reduce module work together to complete the word statistics. Therefore, the CPU resource requirements of the task are very small, and the results of the experiment also verify this. This experiment uses WordCount program to test the files of 100M, 300M, 500M, 1G and 2G respectively, and the result is shown in table IV. Table 4. WordCount test results Size of statistics file 100M 300M 500M 1G 2G Normal execution time/s 36 74 137 368 852 Figure 2. WordCount test results In Figure 2, the test results show that the execution time of the WordCount program increases with the increase of data size. When the file is relatively small, the execution time increases slowly, the file 4

size reaches 1G or more, and the execution time increases faster. In general, the file size increases by 1 times, and the execution time is increased by more than 1 times. 3.3. TestDFSIO test Hadoop has a number of benchmark programs that are packaged in the test program JAR file. TestDFSIO is one of these test programs, and TestDFSIO is used to test the I/O performance of HDFS. Most of the new cloud platform system failures are hard disks. By running I/O intensive tests, you can know the performance of the cluster's hard disk. TestDFSIO uses the MapReduce job to complete the test, which is a convenient way to read and write files in parallel. Each file are read and written in a separate Map task, Map task can also be used for statistical processing of files, the final statistics data is accumulated in the Reduce task. The following experimental conditions are the same amount of data, but the number of files is different, the results shown in Table V and table VI. Table 5. Total data 1 0G file count different write test Single file size/number of files 10G/1 5G/2 2.5G/4 1G/10 400M/25 Write test throughput/(mb s-1) 30.17 14.68 6.89 6.13 6.46 Write test execution time/s 401.78 452.33 519.89 549.92 555.25 Table 6. Total data 1 0G file count different READ test Single file size/number of files 10G/1 5G/2 2.5G/4 1G/10 400M/25 Read test throughput/(mb s-1) 52.56 24.29 4.28 3.93 3.35 Read test execution time/s 254.61 296.56 705.36 756.28 802.39 Figure 3. Test the number of different TestDFSIO test results Figure 3 test results show that the number of files increased from 2 to 4, the execution time increased faster, while the number of files is more than 4, the execution time has not increased significantly. These instructions show that the execution time increases significantly as the number of processing files increases, and when the Reducer parameters are close to the number of cluster nodes, the cloud platform performs faster. 5

3.4. TeraSort test The TeraSort algorithm was created by Microsoft database expert Jim Gray. In 2008, Hadoop used the Tearsort algorithm to sort 1TB data, taking 209 seconds, ranking first. The Tearsort algorithm extracts the abstract from the data first, then distributes the Map output to the Reduce node, and finally completes the sort. The TeraSort program tests the performance of the Hadoop platform by sorting text files. The experiments sorted text files for 100M, 300M, 500M, 1G, and 2G, respectively. The test results are shown in table VII. Table 7. T erasort test results Size of statistics file 100M 300M 500M 1G 2G Normal execution time/s 33 68 141 288 731 Figure 4. T erasort test results Fig. 4 test results show that the execution time of Terasort program increases with the increase of data. When the amount of data processed is within 1G, the execution time increases slowly, and when the amount of data processed increases to 2G, the execution time increases very quickly. 4. Conclusion This paper introduces the implementation of Hadoo cloud computing platform, and then carries out 4 benchmark tests on the Hadoop platform, and finally analyzes the data results. As you can see from the test, the advantages of the Hadoop platform are beginning to emerge as the amount of test data increases. The experiment is only implemented and tested on the virtual machine, and there is still a gap between the data and the reality. Because in actual production, Hadoop platform network size and server performance is very powerful. However, this project has played a role in the performance testing of the Hadoop platform and pointed out the following research directions. Through careful analysis and research of the above test results, we find that the performance of this Hadoop platform needs to be optimized. Hadoop platform can do further optimization research from the parameters configuration and implementation algorithm of Hadoop software architecture. After optimizing these aspects, the performance of the system will be greatly improved. Acknowledgments This work was financially supported by: Fujian Natural Science Foundation of China (Number 2015J01286): Research on e-commerce chain based on Collaborative Perspective in cloud computing environment. JK project of Fujian Provincial Department of Education (Number JK2014037): Research on e- commerce chain coordination based on Cloud Computing. 6

Curriculum Construction Project of Quanzhou Normal University: Demonstration Network Course Construction--ERP (Enterprise resource planning). Students Innovation and Entrepreneurship Training Program Funded Projects of Quanzhou Normal University. Fujian Province young and middle-aged teacher education research project (Number JAT170494): Research on university teaching resource sharing based on Cloud Computing. References [1] Yong Qi Han, Yun Zhang, Shui Yu. Research of Cloud Storage Based on Hadoop Distributed File System [J]. Applied Mechanics and Materials, 2014, 2987(513). [2] Fan Sujuan, Tian Junfeng. Research and implementation of cloud computing platform based on Hadoop [J]. Computer technology and development, 2016, (07): 127-132. [3] Xia Jingbo, Wei Zekun, Fu Kai, Chen Zhen. Overview of research and application of Hadoop technology in cloud computing [J]. Computer Science, 2016, (11): 6-11+48. [4] Wei Dong Fei. Research on Hadoop Technology in Large Data and Cloud Computing [J]. Information Systems Engineering, 2017, (02): 58. [5] Yao Ye, Chang Guangyan. Hadoop performance analysis based on cloud computing heterogeneous environment [J]. Computer and telecommunications, 2016, (10): 27-29. [6] Yu Hui, Huang Yongfeng, Hu Ping. Design and implementation of Hadoop storage and management platform for microblogging public opinion [J]. Electronic Technology Applications, 2017, (03): 120-123 + 131. [7] Pei Hao. Optimization and implementation of Hadoop platform based on GPU [J]. Fujian computer, 2017, (03): 41-42. [8] Li Xiaojia, Dong Yanhua. Building and Application Research Based on Hadoop platform [J]. Fujian computer, 2017, (03): 132-133. [9] Zhang Yan, Wang Wei. Optimization of cloud platform parameters based on Hadoop [J]. Journal of Shenyang Normal University (Natural Science Edition), 2017, (02): 234-239. [10] Su Shupeng. Design and application of large data solution based on Hadoop [J]. Journal of Hechi University, 2017, (02): 89-93. 7