Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Similar documents
Mounica B, Aditya Srivastava, Md. Faisal Alam

A brief history on Hadoop

Batch Inherence of Map Reduce Framework

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Distributed Face Recognition Using Hadoop

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Introduction to Hadoop and MapReduce

CLIENT DATA NODE NAME NODE

Distributed Systems 16. Distributed File Systems II

Decision analysis of the weather log by Hadoop

Distributed File Systems II

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

MI-PDB, MIE-PDB: Advanced Database Systems

HADOOP FRAMEWORK FOR BIG DATA

Global Journal of Engineering Science and Research Management

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Hadoop and HDFS Overview. Madhu Ankam

Lecture 11 Hadoop & Spark

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Dept. Of Computer Science, Colorado State University

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Chapter 5. The MapReduce Programming Model and Implementation

A Fast and High Throughput SQL Query System for Big Data

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Hadoop. copyright 2011 Trainologic LTD

A Review Approach for Big Data and Hadoop Technology

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

Hadoop/MapReduce Computing Paradigm

A Review Paper on Big data & Hadoop

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

A Novel Architecture to Efficient utilization of Hadoop Distributed File Systems for Small Files

High Performance Computing on MapReduce Programming Framework

Distributed Filesystem

Map Reduce & Hadoop Recommended Text:

Google File System (GFS) and Hadoop Distributed File System (HDFS)

A BigData Tour HDFS, Ceph and MapReduce

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

CS427 Multicore Architecture and Parallel Computing

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c

Data Analysis Using MapReduce in Hadoop Environment

UNIT-IV HDFS. Ms. Selva Mary. G

MapReduce-style data processing

CS370 Operating Systems

Your First Hadoop App, Step by Step

MapReduce. U of Toronto, 2014

A Survey on Big Data

CA485 Ray Walshe Google File System

Embedded Technosolutions

Big Data XML Parsing in Pentaho Data Integration (PDI)

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

Projected by: LUKA CECXLADZE BEQA CHELIDZE Superviser : Nodar Momtsemlidze

CS370 Operating Systems

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Programming Systems for Big Data

Clustering Lecture 8: MapReduce

Comparative Analysis of K means Clustering Sequentially And Parallely

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Expert Lecture plan proposal Hadoop& itsapplication

10 Million Smart Meter Data with Apache HBase

BigData and Map Reduce VITMAC03

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

The Google File System. Alexandru Costan

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d

A New HadoopBased Network Management System with Policy Approach

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Asst.Professor, Department of Computer Applications SVCET, Chittoor, Andhra Pradesh, India

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Chuck Cartledge, PhD. 24 September 2017

Getting Started with Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Survey on MapReduce Scheduling Algorithms

50 Must Read Hadoop Interview Questions & Answers

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

MapReduce, Hadoop and Spark. Bompotas Agorakis

Deploy Hadoop For Processing Text Data To Run Map Reduce Application On A Single Site

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!

Hadoop An Overview. - Socrates CCDH

HADOOP BLOCK PLACEMENT POLICY FOR DIFFERENT FILE FORMATS

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Exam Questions CCA-500

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Big Data for Engineers Spring Resource Management

Database Applications (15-415)

Transcription:

Volume 114 No. 12 2017, 323-331 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Enhanced Hadoop with Search and MapReduce Concurrency Optimization Pravin Sanap 1,Bharat Patare 1, Ajay Waghmare 1, Mukesh Rathod 1, Snehal Mulay 1 1 Dept of Information Technology, PVG s College of Engg & Technology, Pune, India Abstract.Hadoop is used to process Big Data in parallel. But the major disadvantage of Hadoop is that dealing with small files is time consuming. Existing Hadoop considers single file as a single block without considering the block size. Hence for huge number of small files, Hadoop will create the single block for each small file increasing the metadata size of Namenode which is inefficient. In proposed solution called Enhanced Hadoop, small files are merged into a single block while uploading from local file system to HDFS. This reduces the metadata size of the NameNode. There are some improvements in the Common Job block Table(CJBT)in which along with storing the block Locations of the related jobs, the proposed solution also stores the job IDs found in the block locations of the searched job ID. Due to this, searching time is optimized for the previously searched as well as non-searched block locations. Keywords :Hadoop, HDFS, MapReduce, H2Hadoop, Related Jobs, Enhanced Hadoop, Hadoop problems, Weather Station Application 1 Introduction Hadoop is an Open source, Java-based programming framework that supports the processing and storage of extremely large datasets in a distributed computing environment. There are two main components of Hadoop: HDFS (Hadoop Distributed File System) &MapReduce[1][2][3]. 1.1 HDFS HDFS is the file system used by Hadoop for storing the huge amount of data[10]. HDFS is deployed on low cost commodity hardware. When HDFS takes in data it breaks the information down into separate pieces called BLOCKS and distributes them on different nodes in a cluster, for parallel processing. The file system also creates multiple copies each piece of data and distributes the copies to individual nodes, placing at least one copy on a different server rack than the others, making HDFS fault tolerant. HDFS has two main components i.e. Namenode and Datanode. There is a singlenamenode[9] per cluster that manages file system operations and one Datanode on each node in a cluster that manages data storage on each individual nodes. 1.2. MapReduce 323

MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware. MapReduce is processing technique and a program model for distributed computing based on Java. The MapReduce job contains two important tasks, namely Map and Reduce[11]. Map takes a set of data and converts it into another set of data, where individual elements are broken down into key/value pairs. Secondly, reduce task, which takes the output from a map as an input and combines those key/value pairs into a smaller set of key/value pairs. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. The Map tasks are performed simultaneously on each block of the data. 2 Problems With Existing Hadoop The main problem[4] currently existing Hadoop facing is dealing with the small size files. It takes more time to process small files as compared to large size files and it also increases the burden on Namenode[12]. Another problem with existing Hadoop is that it doesn t keep track of the previously executed jobs. 2.1 H2Hadoop In H2Hadoop[5], before assigning tasks to the DataNodes, there is a pre-processing phase in the Namenode.[8] In this phase, one table named CJBT which stores the job ID as well as the block locations of the previously executed MapReduce jobs, is maintained. Each time before assigning the block locations to the MapReduce jobs, the Namenode refers to the CJBT table to see whether the currently executing job has already executed or not. If it is, the Namenode picks the block locations from the CJBT table and assigns to the MapReduce job, otherwise the Namenode assigns all block locations to the MapReduce job. Consider the following CJBT table: Station ID Block Locations 010010-99999-1935 hdfs://localhost:54310/allinone/1484423492872:402653184+134217728$ Fig. 1.CJBT Table Here the user searches for the station ID 010010-99999 for the year 1935 and the CJBT table stores all the block locations which contain the searched station ID. Now when the user will search for any month, day, hour or minute related to the station ID 010010-99999 and the year 1935, instead of sending the MapReduce job to all block locations, it will be sent to only those block locations which are present in the Block Locations column of the CJBT table. This reduces the CPU Read-Write cycles as well as the time required to search the record. 2.2 Related Jobs Jobs which search for common results are called related jobs[5]. In Weather Application, suppose we want to search for the station ID 010010-99999 for the year 1935. Then every month, day, hours and minutes belonging to the station ID 010010-99999 for the year 1935 will be considered as related jobs. 324

3 Enhanced Hadoop Our Enhanced Hadoop overcomes the previously mentioned problems of existing Hadoop. In this system, the CJBT table stores the block locations of the executed MapReduce job as well as other job IDs that exists in the block locations of the executed MapReduce job. Also the Enhanced Hadoop deals with the problem of small files by merging them together before uploading, which reduces metadata size in NameNode[6]. 3.1 Workflow of Enhanced Hadoop Fig. 2.Enhanced Hadoop Workflow As shown in Fig. 2.[13], first the user sends a request to Namenode to get the block locations to execute the MapReduce job upon. But before returning the block locations to the user, the Namenode checks the data in HDFS. If the data is not present in HDFS, the Namenode will copy the data from local filesystem to HDFS. 325

The Namenode will merge the files together and write them into a single block. In next step, the Namenode will look into the Advanced CJBT table to see whether the currently executing job ID exists in the Advanced CJBT table or not. If it is present, the Namenode will pick the block locations from the Advanced CJBT table and send them to the user. But if the job ID doesn t exist in the CJBT table, the Namenode will still refer the Available Station IDs column of the Advanced CJBT table ( As shown in Fig. 3) to see whether the currently executing job ID exists in the previously searched block locations.if it does, the Namenode will return the all available block locations to the user otherwise it will skip those previously searched block locations in which it didn t find the currently executing job ID. After receiving the block locations from Namenode, the user will launch the MapReduce tasks on the received block locations and will store the result into the HDFS. 3.2 Advanced CJBT Table As compared to the H2Hadoop s CJBT table, the Advanced CJBT table introduces one additional column called Available Station IDs in which it stores the Stations IDs found in the block locations of the previously executed MapReduce jobs. Following table shows the structure of the Advanced CJBT table : Station ID Searched Station Locations Available Station IDs 010010999 hdfs://localhost:54310/allinone/14844234928 010017999991999$010015999991999$010010999991931$01 991935 010010999 991955 010140999 991974 72:402653184+134217728$ hdfs://localhost:54310/allinone/14844234928 72:0+134217728$ hdfs://localhost:54310/allinone/14844234928 72:402653184+134217728$ 0010999991933$010010999991955$ Available$ 200999992015$ Fig. 3.Advanced CJBT Table 010200999992012$010200999992011$010200999992013$010 When the user searches for the station ID 010010999991935, in Searched Station Locations column, Enhanced Hadoop will store the locations of all the blocks which contain all records related to the station ID 010010999991935. When the Enhanced Hadoop will find the block location of the Station ID 010010999991935, Available Station IDs column will store all station IDs present in the found block locations. For example, in the above Fig.3, Enhanced Hadoop have found the station ID 010010999991935 in the block location hdfs://localhost:54310/allinone/1484423492872:402653184+134217728$, which also contains the other Station IDs like 010017999991999$010015999991999$010010999991931$ 010010999991933$010010999991955$ which are separated by $. Next time, if the user searches for the station ID 010010999991955,as the searched Station ID isnot related to the previously searched station ID, the MapReduce job request will be sent to the all blocks. But before sending the MapReduce job to all blocks, Enhanced Hadoop first sees the Available Station IDs column of the 010010999991935 row id. As this row ID contains the 010010999991955station ID, it means that the searched station ID exists in the block location of the 010010999991935 station ID. So the MapReduce job will be sent to the all blocks. As the block locations of both the station IDs are same, it means that the both Station IDs block locations will have the same Available Station IDs. So instead of writing the available station IDs again, Enhanced Hadoop just writes Available indicating that the station IDs are already available. Now the user again searches for the 326

different station ID 010140999991974. As it is also not a related job to previous two station IDs, the MapReduce job request will be sent to the all blocks. But before sending the request Enhanced Hadoop will see the Available Station IDs column of the previously searched station ID rows. As the searched station ID is not available in the already searched block locations, it means that the searched station ID is not available in these block locations. So sending request to these block locations will be useless. So Enhanced Hadoop sends it to all blocks, except one block location hdfs://localhost:54310/allinone/1484423492872:402653184+ 134217728$, as it doesn t contain the 010140999991974 station ID. 3.3 Optimization of MapReduce Concurrency The main drawback of Hadoop is that it can t deal with the small files. It takes more time to process the small files of size Kbs or Mbs as compared to the big files like of size GBs or Tbs. In normal Hadoop, when the data is uploaded, the single file is treated as a single block. Suppose the block size is of 128 MB, then even if the file is of size 1 KB it will be considered as a single block. So if the 10 files of size 1 KB each are uploaded into the HDFS, these 10 files will be uploaded in 10 different blocks of size 128 MB each. In the Weather Dataset[7], there are thousands of small files ranging from Kbs to at max 20 to 30 Mbs. So there will be thousands of blocks. This problem will cause the Hadoop to face the following problems : I] If block size is small then metadata size will be high. II] If block size was set to less than 64 MB,there would be a huge no of blocks throughout the cluster,which causes namenode to manage an enormous amount of metadata. III] Since we need a mapper for each block,there would be a lot of mappers,each processing a piece bit of data,which isn't efficient. IV] HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. To overcome these problems, Enhanced Hadoop will upload the small files in the following way : Consider the block size is of 128 MB. First of all suppose the user selected the 90 small files of size 87.19 Mbs. So after running the Upload.jar file the selected 90 files will be uploaded in the HDFS as follows : Permission Owner Group Size Last Modified Replication Block Size Name -rw-r-r-- hduser subgroup 87.19 MB 2/6/2017,2:59 AM 1 128 MB 1486328420018_ap Fig. 4.HDFS Browse Directory As shown in Fig.4, instead of creating the 90 blocks, the Enhanced Hadoop merged all the small files together into the single file. So it created only a single block of size 87.19 Mb.Consider again some data need to be uploaded into the HDFS of size 45.44 MB. As there is still 41 Mb space available into the previous block, the Enhanced Hadoop will merge these 45.44 MB of data with previous file till the completion of 128 Mb block and remaining 4.63 Mb data will get stored in next block as shown below : 327

Permission Owner Group Size Last Modified Replication Block Size -rw-r-r-- hduser subgroup 128MB 2/6/2017,3:44:59AM 1 128 MB 1486332890397 -rw-r-r-- hduser subgroup 4.63 MB 2/6/2017,3:44:59AM 1 128 MB 1486332890397_ap Fig. 5.HDFS Browse Directory Name 4 Results Of Various Experiments Considering one master node and two slave nodes with Ubantu 16.10 OS, Apache Hadoop 2.5.2, Apache Hbase 1.2.3, Eclipse IDE, various experiments are carried on Weather Station Application[7], following results are generated : 4.1 Execution time required by separate small files Vs Execution time required by merging those small files Fig. 6.Execution time of Separate small files Vs Merged files As the Fig. 6.shows the execution time required by an application is 26.38 seconds which is very much less as compared to the existing Hadoop system. 4.2 Native Hadoop Vs Enhanced Hadoop As the Fig. 7., shows, the first time execution of Native and Enhanced Hadoop reads the same number of blocks and lines as well as it takes the same time to execute the MapReduce job. 328

execution. Fig.7.First time job execution of Native Hadoop and Enhanced Hadoop Fig. 8., shows the huge difference between Native Hadoop and Enhanced Hadoop in case of related job Fig. 8.Related job execution of Native Hadoop and Enhanced Hadoop As the Fig.9 shows, even though the new job is getting executed, the number of blocks read by Enhanced Hadoop is one less as compared to Native Hadoop. Because as mentioned previously the Advanced CJBT table introduces additional column called Available Station IDs which stores the Station IDs found in the block location of the previously searched station ID. 329

Fig. 9.Unrelated job execution of Native HadoopVs Enhanced Hadoop Conclusion The Enhanced Hadoop framework modifies the existing Hadoop framework by sending MapReduce job requests to only those blocks where the required data is present reducing the CPU read-write cycles as well as the time required to execute those MapReduce jobs. It also deals with the problem of metadata size of NameNode. The small files are combined together to form a block. So both search &MapReduce concurrency optimizations are acheived in the new Hadoop framework. References [1] White, T., Hadoop: The definitive guide. 2012: " O'ReillyMedia,Inc.". [2] Patel, A.B., M. Birla, and U. Nair. Addressing big data problem using Hadoop and MapReduce.in Engineering (NUiCONE), 2012 Nirma University International Conference on. 2012. [3] http://www.whatis.com [4] Jagadish, H., et al., Big data and its technical challenges.communications of the ACM, 2014. 57(7): p. 86-94. [5] HamoudAlshammari, Jeongkyu Lee and Hassan Bajwa; H2Hadoop: Improving Hadoop Performance using the Metadata of Related Jobs [6] [Manning] - Hadoop in Action-eBook [7] http://www.ncdc.noaa.gov/pub/data/noaa/ [8] https://www.edureka.co/blog/anatomy-of-a-mapreduce-job-in-apache-hadoop/ [9] http://ercoppa.github.io/hadoopinternals/hadooparchitectureoverview.html [10] https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html [11] https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html [12] https://community.hortonworks.com/articles/15104/small-files-in-hadoop.htm [13] https://creately.com/diagram/example/hwlfeptt/hadoop%20process 330

331

332