ABSTRACT I. INTRODUCTION

Similar documents
MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

The MapReduce Framework

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Clustering Lecture 8: MapReduce

Survey on MapReduce Scheduling Algorithms

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Map Reduce Group Meeting

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Introduction to MapReduce

MapReduce: A Programming Model for Large-Scale Distributed Computation

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Batch Inherence of Map Reduce Framework

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Introduction to MapReduce

HADOOP FRAMEWORK FOR BIG DATA

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Mitigating Data Skew Using Map Reduce Application

CS 61C: Great Ideas in Computer Architecture. MapReduce

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Mounica B, Aditya Srivastava, Md. Faisal Alam

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

MapReduce: Simplified Data Processing on Large Clusters

The MapReduce Abstraction

Big Data Management and NoSQL Databases

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

How to Implement MapReduce Using. Presented By Jamie Pitts

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Cloud Computing CS

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Mining Distributed Frequent Itemset with Hadoop

Parallel Computing: MapReduce Jin, Hai

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

MI-PDB, MIE-PDB: Advanced Database Systems

Survey on Incremental MapReduce for Data Mining

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Global Journal of Engineering Science and Research Management

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

L22: SC Report, Map Reduce

Lecture 10.1 A real SDN implementation: the Google B4 case. Antonio Cianfrani DIET Department Networking Group netlab.uniroma1.it

QADR with Energy Consumption for DIA in Cloud

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

IN organizations, most of their computers are

CSE 444: Database Internals. Lecture 23 Spark

Hadoop Map Reduce 10/17/2018 1

MapReduce. Cloud Computing COMP / ECPE 293A

Volume 3, Issue 11, November 2015 International Journal of Advance Research in Computer Science and Management Studies

Research Article Apriori Association Rule Algorithms using VMware Environment

Introduction to MapReduce

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

An improved MapReduce Design of Kmeans for clustering very large datasets

Map Reduce. Yerevan.

Chapter 5. The MapReduce Programming Model and Implementation

CompSci 516: Database Systems

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Introduction to Map Reduce

Improving MapReduce Energy Efficiency for Computation Intensive Workloads

Hadoop/MapReduce Computing Paradigm

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data

Dynamically Estimating Reliability in a Volunteer-Based Compute and Data-Storage System

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Embedded Technosolutions

Azure MapReduce. Thilina Gunarathne Salsa group, Indiana University

Giza: Erasure Coding Objects across Global Data Centers

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Large-Scale GPU programming

Introduction to MapReduce

High Performance Computing on MapReduce Programming Framework

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Database Systems CSE 414

Introduction to MapReduce

Cassandra- A Distributed Database

Introduction to Hadoop and MapReduce

Twitter data Analytics using Distributed Computing

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

Introduction to Data Management CSE 344

Transcription:

International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve Efficiency Sujit Roy 1, Subrata Kumar Das *2, Indrani Mandal 3 Department of Computer Science and Engineering, Jatiya Kabi Kazi azrul Islam University, Trishal, Mymensingh, Bangladesh ABSTRACT To manage, process, and analyze very large datasets, HADOOP has been a powerful, fault-tolerant platform HADOOPis used to access big data because it is effective, scalable and is well supported by large trafficker and user communities This research paper proposed a new approach to process the data in HADOOP to achieve the efficiency of data processing by using synchronous data transmission, sending block of data from source to destination Here a method has been shown how to divide the data blocks in achieving optimal efficacy by adjusting the split size or using appropriate size of staffs As the effective HADOOP hardware configuration matches the requirements of each periodic task, so this allows our system to the data blocks increasing data efficiency as well as throughput Finally, experiments showed the effectiveness of these methods with high data efficiency (around 22% more than existing system), low installation cost and the feasibility of this method Keywords: Hadoop, Map Reduce, Data efficiency, Data blocks, HDFS I ITRODUCTIO HADOOP [7] is the popular open source implementation of map reduce, a powerful tool designed for deep analysis and transformation of very large data sets It is capable to connect and coordinate thousands of nodes inside a cluster The HADOOP framework has two components such as HADOOP Distributed file system (HDFS) [6] and Map Reduce [8] HDFS is designed for data storage and Map Reduce for data processing The HDFS layer provides a virtual file system to client application An important feature of HDFS is data replication and each file is divided into blocks and replicated among nodes At first, all data are stored in HDFS From the point of data efficiency, this essay proposed a method to measure the effectiveness of periodic tasks that run on HADOOP platform, and it also studied how to divide the data blocks, and it servers in the cluster contains data from different blocks, the analysis of measurement methods The step to study the energy efficiency of the cloud is to have evaluation criteria The main task of this research paper, to manage how data file are divided and stored across the clusters by using HADOOP distributed file system Data is divided into blocks, and individual server in the cluster contains data from different blocks The HADOOP is a distributed framework that makes it easier to process large data sets that reside in clusters of computers Because it is a framework, HADOOP is not a single technology or product It can work with any distributed file system However the HDFS is the primary means for doing so and is the heart of HADOOP technology HDFS manages how data files are divided and stored across the cluster Data blocks are divided into, each server in the cluster contains several data from different blocks Map Reduce With the Map Reduce programming model, programmers only need to specify two functions: CSEIT183320 Received: 01 March 2018 Accepted: 11 March 2018 March-April-2018 [(3)3: 122-127] 122

Subrata Kumar Das et al Int J S Res CSE & IT 2018 Mar-Apr;3(3):122-127 Map and Reduce Map whichtakes an input pair and creates a set of intermediate key/value pairs The library groups of Map Reduce together all intermediate values associated with the same intermediate key I and passes them to the Reduce function The Reduce function accepts an intermediate key I and a set of values of that key It aggregates together those values to form a possibly smaller set of values HDFS File 1 File 2 File 3 Map Step Job Split 0 Split 1 Split 2 MAP REDUCE Task Task Task Figure 1 HDFS and Map Reduce Flow In Figure 1 the map step has two nodes one master node and other worker nodes The master node splits the data and assigns the key value pairs The master node picks the idle workers and assigns each one a job to set the intermediate key value pairs and send back to master node The master node keeps the details information about the location of the data The Reduce function receives the intermediate key value pairs and reduces to a smaller solution The remainder of the paper is structured in the following manner Section 2 presents the related work in this area and shows the contribution of this work In section 3, we describe the proposed map reduce architecture Section 4 depicts the methods of measuring efficiency of the systems Section 4 presents our experimental results and discussion and section 5 concludes this paper Reduce Step Output 0 Output 1 Volume 3, Issue 3 March-April-2018 http://ijsrcseitcom HDFS File 1 File 2 File 3 II RELATED WORK ow, we present some related research work One of the biggest challenges in distributed systems like HADOOP is to guarantee the data stability even if some nodes fail Thus, it is important to maintain redundancy for stored data in HADOOP system A more interesting research, by Leverichet al [9] who develop a Mechanism named covering subset Thus, this technique is inefficient The focus of this paper, the constellation management system is similar in essence to other systems such as Condor [10] The Map Reduce implementation relies on data cluster management system that is responsible for distributing and running user tasks on a large collection of shared mechanisms After promoting the concept of cloud computing in 2006, it has developed rapidly and many companies have started their own cloud computing platform, such as Azure from Microsoft, IBM Blue Cloud Computing Platform from IBM, Google Compute Engine from Google, and Dynamo from Amazon [1] Though the technology of cloud computing develops continuously, HADOOP platform is the most popular object which majority of scholars study HADOOP computing platform is composed of Map Reduce computing framework and HDFS file storage system, and current research about HADOOP [2, 3] focuses on performance optimization but it s processing efficiency is not high for small files [4, 5] III PROPOSED MAPREDUCE ARCHITECTURE Map Reduce allows for distributed processing of the map and reduction operations In mapping operation, all maps can be performed in parallel Similarly, a set of 'reducers' can perform the reduction phase, which provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative 123

Subrata Kumar Das et al Int J S Res CSE & IT 2018 Mar-Apr;3(3):122-127 In this model, a user specifies the computation by two functions Map and Reduce In the mapping phase, Map Reduce tasks the input data and feeds each data element to the mapper In the Reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result defined as per data structured in <key, value> pairs The Map function is moved in parallel to take one pair of data with a type in the input dataset and then produces a list of intermediate pairs for calling Reduce ( 1, 1) ( 2, 2); ( 2, ( 2)) ( 3) Block Size is another important factor of the implementation of HADOOP, which also makes effect on data efficiency for Map Reduce For Map Reduce, HADOOP divides the input to a Map Reduce job into fixed-size pieces (usually equals block size) called input splits HADOOP makes one map task for each split IV METHODS Figure 2 Proposed Map Reduce Architecture All the input data associated with the key valueis shown in figure 2 The data will be grouped for intermediate by using key value Reduce function can be used several key value Finally we can show reducing output file The Map function is applied in parallel to every pair (keyed by k1) in the input dataset This produces a list of pairs (keyed by k2) for each call After that, the Map Reduce framework collects all pairs with the same key (k2) from all lists and groups them together, creating one group for each key Thus the Map Reduce framework transforms a list of (key, value) pairs into a list of values A list of arbitrary values and returns one single value that combine all the values returned by Map A Proposed Data Transmission ology In the proposed method, the data blocks are divided to get maximum data transmission efficiency In this method at first the target data are stored in a primary storage device To transmit the data, the format of data container in Figure 3 will take 80-132 characters in size Trailer and header are also connected to the data field Both the trailer and header contain an 8 bits flag and control field The data are divided into block (packet or frame) and transmit one block each time There are block interval between two blocks Every block data contain 2 bytes header information at the front and 2 bytes trailer information at the end Figure 3 The format of data transmission block The Map Reduce programming model divides a program into tasks, of which there are two types: map () and reduce () Both the two functions are Volume 3, Issue 3 March-April-2018 http://ijsrcseitcom 124

Subrata Kumar Das et al Int J S Res CSE & IT 2018 Mar-Apr;3(3):122-127 B Algorithm of Proposed Approach Step1 The input reader reads data from stable storage (typically a distributed file system) and generates key/value pairs Step2 Split the input file (0 ) Step3 The Map () is used to the split file (split 0 split ) Step4 Then the split files are shuffling according to their file category Step5 Apply Reduce () to optimize it The Reduce can iterate through the values that are associated with that key and produce zero or more outputs Step6 Finally analyze the files efficiency by using the formula Data Transmission efficiency= 100 % Here, Total Data=Real Data + Overhead data C Mathematical Analysis ow we assume the Throughput of the data file Throughput is the ratio of the file size and time Throughput () = i= 0 i Filesize = 0 Time i i Mb/sec Size (MB)/Transfer Speed (kbps) = Time(s)] =0 Rate i i Average IO Rate () = To analyze the efficiency by using the formula Data Efficiency (µ) = 100% Overhead Data = 32 [File Here, =, = = h Data Transfer Speed (kbps) = ( ) ( ) V RESULTS AD DISCUSSIO In this research various size of data are used to process in existing and proposed methods to get a comparison between the systems The outcomes of the proposed system showed a better efficiency than the existing methods The different size of data below 100 MB are processed in existing system and new system, which are tabulated in Table 1 Table 1Different stored file from 0mb to 100 MB and corresponding efficiency Memory Data Transmission Efficiency (%) Storage Space(MB) Existing Proposed 5 625 135 10 625 238 15 710 319 20 689 385 25 675 439 30 714 484 35 700 522 40 727 555 45 714 584 50 704 609 55 723 632 60 714 652 65 706 670 70 721 686 75 714 701 80 727 714 85 720 726 90 714 737 95 725 748 100 719 757 The data transmission efficiency listed in this table express that if the data size is less the efficiency of the existing system is higher than the proposed system But this efficiency trend skipped the previous method when the amount of data passed over 80MB is shown in Figure 4 Volume 3, Issue 3 March-April-2018 http://ijsrcseitcom 125

Subrata Kumar Das et al Int J S Res CSE & IT 2018 Mar-Apr;3(3):122-127 Figure 4 The efficiency trend found between 0 MB 100 MB The proposed system showed a better efficiency when the size of data become more than 100 MB tabulated in Table 2 Here it is followed that the efficiency between both systems showed a big difference The highest efficiency of the existing system was around 73% which is about 22% less than the proposed system with 95% It is noticeable that the efficacy of the both methods became saturated after a certain amount of data size depicted in Figure 5 Table 2Different stored file over 100 MB and corresponding efficiency Memory Data Efficiency (%) Storage Space Existing Proposed (MB) 512 7272 9411 1022 7268 9410 1535 7271 9411 2372 7269 9488 2550 7271 9522 50555 72725 95237 61000 72727 95205 75013 72726 95207 99236 72726 95210 152345 727263 95219 250123 727268 95235 Volume 3, Issue 3 March-April-2018 http://ijsrcseitcom Figure 5 Comparison graph of the efficiency found from both systems VI COCLUSIO AD FUTURE WORK This research focuses on to measure the better data efficiency of periodic tasks that run on HADOOP platforms It also studied how to divide the data blocks efficiently to distribute and manage the data Data block distribution according to the proposed architecture in HADOOP achieved around 22% more efficiency than the other existing systems ext we plan to work on Bandwidth in HADOOP clustering As the HADOOP contains different categories of file extension, so it would be possible to increase data transmission rate by grouping files before execution VII REFERECES [1] J Dean, and S Ghemawat, "Map Reduce: Simplified Data Processing on Large Clusters," In: OSDI 2004: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, San Francisco, CA, pp 137-149, 2004 [2] J Pan, Y L Biannic, and F Magoulès, "Parallelizing Multiple Group-by Query in Share-othing Environment: a Map Reduce Study Case," in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, Illinois, pp 856-863, 2010 126

Subrata Kumar Das et al Int J S Res CSE & IT 2018 Mar-Apr;3(3):122-127 [3] S Chen, and S Schlosser, S W:Map-Reduce Meets Wider Varieties of Applications, IRP- TR-08-05, Technical Report, Intel Research Pittsburgh, 2008 [4] S Leo, and G Zanetti, "Pydoop: a Python Map Reduce and HDFS API for HADOOP," in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, Illinois, pp 819-825, 2010 [5] D Huang, X Shi, and S Ibrahim et al, "MR- Scope: a Real-Time Tracing Tool for MapReduce," in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, Illinois, pp 849-855, 2010 [6] P Kumar, and V S Rathore, "Efficient Capabilities of Processing of Big Data using Hadoop Map Reduce," International Journal of Advanced Research in Computer and Communication Engineering, vol 3, issue 6, pp 7123-7126, 2014 [7] J Liu, F Liu, and Ansari, "Monitoring and analyzing big traffic data of a large-scale cellular network with Hadoop," etwork, vol 28, issue 4, pp 32-39, 2014 [8] C Chen, Z Liu, W Lin, S Li, and K Wang, "Distributed Modeling in a MapReduce Framework for Data-Driven Traffic Flow Forecasting," Intelligent Transportation Systems, vol 14, issue 1, pp 22 33, 2013 [9] J Leverich and C Kozyrakis, On the energy (in) efficiency of hadoop clusters, ACM SIGOPS Operating Systems Rev, vol 44, issue 1, pp 61 65, 2010 [10] D Thain, T Tannenbaum, and M Livny, Distributed computing in practice: The Condor experience, Concurrency and Computation: Practice and Experience, vol 17, issue 2-4, pp 323-356, 2005 Volume 3, Issue 3 March-April-2018 http://ijsrcseitcom 127