Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Volume 114 No. 12 2017, 323-331 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Enhanced Hadoop with Search and MapReduce Concurrency Optimization Pravin Sanap 1,Bharat Patare 1, Ajay Waghmare 1, Mukesh Rathod 1, Snehal Mulay 1 1 Dept of Information Technology, PVG s College of Engg & Technology, Pune, India Abstract.Hadoop is used to process Big Data in parallel. But the major disadvantage of Hadoop is that dealing with small files is time consuming. Existing Hadoop considers single file as a single block without considering the block size. Hence for huge number of small files, Hadoop will create the single block for each small file increasing the metadata size of Namenode which is inefficient. In proposed solution called Enhanced Hadoop, small files are merged into a single block while uploading from local file system to HDFS. This reduces the metadata size of the NameNode. There are some improvements in the Common Job block Table(CJBT)in which along with storing the block Locations of the related jobs, the proposed solution also stores the job IDs found in the block locations of the searched job ID. Due to this, searching time is optimized for the previously searched as well as non-searched block locations. Keywords :Hadoop, HDFS, MapReduce, H2Hadoop, Related Jobs, Enhanced Hadoop, Hadoop problems, Weather Station Application 1 Introduction Hadoop is an Open source, Java-based programming framework that supports the processing and storage of extremely large datasets in a distributed computing environment. There are two main components of Hadoop: HDFS (Hadoop Distributed File System) &MapReduce[1][2][3]. 1.1 HDFS HDFS is the file system used by Hadoop for storing the huge amount of data[10]. HDFS is deployed on low cost commodity hardware. When HDFS takes in data it breaks the information down into separate pieces called BLOCKS and distributes them on different nodes in a cluster, for parallel processing. The file system also creates multiple copies each piece of data and distributes the copies to individual nodes, placing at least one copy on a different server rack than the others, making HDFS fault tolerant. HDFS has two main components i.e. Namenode and Datanode. There is a singlenamenode[9] per cluster that manages file system operations and one Datanode on each node in a cluster that manages data storage on each individual nodes. 1.2. MapReduce 323

MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware. MapReduce is processing technique and a program model for distributed computing based on Java. The MapReduce job contains two important tasks, namely Map and Reduce[11]. Map takes a set of data and converts it into another set of data, where individual elements are broken down into key/value pairs. Secondly, reduce task, which takes the output from a map as an input and combines those key/value pairs into a smaller set of key/value pairs. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. The Map tasks are performed simultaneously on each block of the data. 2 Problems With Existing Hadoop The main problem[4] currently existing Hadoop facing is dealing with the small size files. It takes more time to process small files as compared to large size files and it also increases the burden on Namenode[12]. Another problem with existing Hadoop is that it doesn t keep track of the previously executed jobs. 2.1 H2Hadoop In H2Hadoop[5], before assigning tasks to the DataNodes, there is a pre-processing phase in the Namenode.[8] In this phase, one table named CJBT which stores the job ID as well as the block locations of the previously executed MapReduce jobs, is maintained. Each time before assigning the block locations to the MapReduce jobs, the Namenode refers to the CJBT table to see whether the currently executing job has already executed or not. If it is, the Namenode picks the block locations from the CJBT table and assigns to the MapReduce job, otherwise the Namenode assigns all block locations to the MapReduce job. Consider the following CJBT table: Station ID Block Locations 010010-99999-1935 hdfs://localhost:54310/allinone/1484423492872:402653184+134217728$ Fig. 1.CJBT Table Here the user searches for the station ID 010010-99999 for the year 1935 and the CJBT table stores all the block locations which contain the searched station ID. Now when the user will search for any month, day, hour or minute related to the station ID 010010-99999 and the year 1935, instead of sending the MapReduce job to all block locations, it will be sent to only those block locations which are present in the Block Locations column of the CJBT table. This reduces the CPU Read-Write cycles as well as the time required to search the record. 2.2 Related Jobs Jobs which search for common results are called related jobs[5]. In Weather Application, suppose we want to search for the station ID 010010-99999 for the year 1935. Then every month, day, hours and minutes belonging to the station ID 010010-99999 for the year 1935 will be considered as related jobs. 324

3 Enhanced Hadoop Our Enhanced Hadoop overcomes the previously mentioned problems of existing Hadoop. In this system, the CJBT table stores the block locations of the executed MapReduce job as well as other job IDs that exists in the block locations of the executed MapReduce job. Also the Enhanced Hadoop deals with the problem of small files by merging them together before uploading, which reduces metadata size in NameNode[6]. 3.1 Workflow of Enhanced Hadoop Fig. 2.Enhanced Hadoop Workflow As shown in Fig. 2.[13], first the user sends a request to Namenode to get the block locations to execute the MapReduce job upon. But before returning the block locations to the user, the Namenode checks the data in HDFS. If the data is not present in HDFS, the Namenode will copy the data from local filesystem to HDFS. 325

The Namenode will merge the files together and write them into a single block. In next step, the Namenode will look into the Advanced CJBT table to see whether the currently executing job ID exists in the Advanced CJBT table or not. If it is present, the Namenode will pick the block locations from the Advanced CJBT table and send them to the user. But if the job ID doesn t exist in the CJBT table, the Namenode will still refer the Available Station IDs column of the Advanced CJBT table ( As shown in Fig. 3) to see whether the currently executing job ID exists in the previously searched block locations.if it does, the Namenode will return the all available block locations to the user otherwise it will skip those previously searched block locations in which it didn t find the currently executing job ID. After receiving the block locations from Namenode, the user will launch the MapReduce tasks on the received block locations and will store the result into the HDFS. 3.2 Advanced CJBT Table As compared to the H2Hadoop s CJBT table, the Advanced CJBT table introduces one additional column called Available Station IDs in which it stores the Stations IDs found in the block locations of the previously executed MapReduce jobs. Following table shows the structure of the Advanced CJBT table : Station ID Searched Station Locations Available Station IDs 010010999 hdfs://localhost:54310/allinone/14844234928 010017999991999$010015999991999$010010999991931$01 991935 010010999 991955 010140999 991974 72:402653184+134217728$ hdfs://localhost:54310/allinone/14844234928 72:0+134217728$ hdfs://localhost:54310/allinone/14844234928 72:402653184+134217728$ 0010999991933$010010999991955$ Available$ 200999992015$ Fig. 3.Advanced CJBT Table 010200999992012$010200999992011$010200999992013$010 When the user searches for the station ID 010010999991935, in Searched Station Locations column, Enhanced Hadoop will store the locations of all the blocks which contain all records related to the station ID 010010999991935. When the Enhanced Hadoop will find the block location of the Station ID 010010999991935, Available Station IDs column will store all station IDs present in the found block locations. For example, in the above Fig.3, Enhanced Hadoop have found the station ID 010010999991935 in the block location hdfs://localhost:54310/allinone/1484423492872:402653184+134217728$, which also contains the other Station IDs like 010017999991999$010015999991999$010010999991931$ 010010999991933$010010999991955$ which are separated by $. Next time, if the user searches for the station ID 010010999991955,as the searched Station ID isnot related to the previously searched station ID, the MapReduce job request will be sent to the all blocks. But before sending the MapReduce job to all blocks, Enhanced Hadoop first sees the Available Station IDs column of the 010010999991935 row id. As this row ID contains the 010010999991955station ID, it means that the searched station ID exists in the block location of the 010010999991935 station ID. So the MapReduce job will be sent to the all blocks. As the block locations of both the station IDs are same, it means that the both Station IDs block locations will have the same Available Station IDs. So instead of writing the available station IDs again, Enhanced Hadoop just writes Available indicating that the station IDs are already available. Now the user again searches for the 326

different station ID 010140999991974. As it is also not a related job to previous two station IDs, the MapReduce job request will be sent to the all blocks. But before sending the request Enhanced Hadoop will see the Available Station IDs column of the previously searched station ID rows. As the searched station ID is not available in the already searched block locations, it means that the searched station ID is not available in these block locations. So sending request to these block locations will be useless. So Enhanced Hadoop sends it to all blocks, except one block location hdfs://localhost:54310/allinone/1484423492872:402653184+ 134217728$, as it doesn t contain the 010140999991974 station ID. 3.3 Optimization of MapReduce Concurrency The main drawback of Hadoop is that it can t deal with the small files. It takes more time to process the small files of size Kbs or Mbs as compared to the big files like of size GBs or Tbs. In normal Hadoop, when the data is uploaded, the single file is treated as a single block. Suppose the block size is of 128 MB, then even if the file is of size 1 KB it will be considered as a single block. So if the 10 files of size 1 KB each are uploaded into the HDFS, these 10 files will be uploaded in 10 different blocks of size 128 MB each. In the Weather Dataset[7], there are thousands of small files ranging from Kbs to at max 20 to 30 Mbs. So there will be thousands of blocks. This problem will cause the Hadoop to face the following problems : I] If block size is small then metadata size will be high. II] If block size was set to less than 64 MB,there would be a huge no of blocks throughout the cluster,which causes namenode to manage an enormous amount of metadata. III] Since we need a mapper for each block,there would be a lot of mappers,each processing a piece bit of data,which isn't efficient. IV] HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. To overcome these problems, Enhanced Hadoop will upload the small files in the following way : Consider the block size is of 128 MB. First of all suppose the user selected the 90 small files of size 87.19 Mbs. So after running the Upload.jar file the selected 90 files will be uploaded in the HDFS as follows : Permission Owner Group Size Last Modified Replication Block Size Name -rw-r-r-- hduser subgroup 87.19 MB 2/6/2017,2:59 AM 1 128 MB 1486328420018_ap Fig. 4.HDFS Browse Directory As shown in Fig.4, instead of creating the 90 blocks, the Enhanced Hadoop merged all the small files together into the single file. So it created only a single block of size 87.19 Mb.Consider again some data need to be uploaded into the HDFS of size 45.44 MB. As there is still 41 Mb space available into the previous block, the Enhanced Hadoop will merge these 45.44 MB of data with previous file till the completion of 128 Mb block and remaining 4.63 Mb data will get stored in next block as shown below : 327

Permission Owner Group Size Last Modified Replication Block Size -rw-r-r-- hduser subgroup 128MB 2/6/2017,3:44:59AM 1 128 MB 1486332890397 -rw-r-r-- hduser subgroup 4.63 MB 2/6/2017,3:44:59AM 1 128 MB 1486332890397_ap Fig. 5.HDFS Browse Directory Name 4 Results Of Various Experiments Considering one master node and two slave nodes with Ubantu 16.10 OS, Apache Hadoop 2.5.2, Apache Hbase 1.2.3, Eclipse IDE, various experiments are carried on Weather Station Application[7], following results are generated : 4.1 Execution time required by separate small files Vs Execution time required by merging those small files Fig. 6.Execution time of Separate small files Vs Merged files As the Fig. 6.shows the execution time required by an application is 26.38 seconds which is very much less as compared to the existing Hadoop system. 4.2 Native Hadoop Vs Enhanced Hadoop As the Fig. 7., shows, the first time execution of Native and Enhanced Hadoop reads the same number of blocks and lines as well as it takes the same time to execute the MapReduce job. 328

execution. Fig.7.First time job execution of Native Hadoop and Enhanced Hadoop Fig. 8., shows the huge difference between Native Hadoop and Enhanced Hadoop in case of related job Fig. 8.Related job execution of Native Hadoop and Enhanced Hadoop As the Fig.9 shows, even though the new job is getting executed, the number of blocks read by Enhanced Hadoop is one less as compared to Native Hadoop. Because as mentioned previously the Advanced CJBT table introduces additional column called Available Station IDs which stores the Station IDs found in the block location of the previously searched station ID. 329

Fig. 9.Unrelated job execution of Native HadoopVs Enhanced Hadoop Conclusion The Enhanced Hadoop framework modifies the existing Hadoop framework by sending MapReduce job requests to only those blocks where the required data is present reducing the CPU read-write cycles as well as the time required to execute those MapReduce jobs. It also deals with the problem of metadata size of NameNode. The small files are combined together to form a block. So both search &MapReduce concurrency optimizations are acheived in the new Hadoop framework. References [1] White, T., Hadoop: The definitive guide. 2012: " O'ReillyMedia,Inc.". [2] Patel, A.B., M. Birla, and U. Nair. Addressing big data problem using Hadoop and MapReduce.in Engineering (NUiCONE), 2012 Nirma University International Conference on. 2012. [3] http://www.whatis.com [4] Jagadish, H., et al., Big data and its technical challenges.communications of the ACM, 2014. 57(7): p. 86-94. [5] HamoudAlshammari, Jeongkyu Lee and Hassan Bajwa; H2Hadoop: Improving Hadoop Performance using the Metadata of Related Jobs [6] [Manning] - Hadoop in Action-eBook [7] http://www.ncdc.noaa.gov/pub/data/noaa/ [8] https://www.edureka.co/blog/anatomy-of-a-mapreduce-job-in-apache-hadoop/ [9] http://ercoppa.github.io/hadoopinternals/hadooparchitectureoverview.html [10] https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html [11] https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html [12] https://community.hortonworks.com/articles/15104/small-files-in-hadoop.htm [13] https://creately.com/diagram/example/hwlfeptt/hadoop%20process 330

331

332