Global Journal of Engineering Science and Research Management

Similar documents
High Performance Computing on MapReduce Programming Framework

BIG DATA & HADOOP: A Survey

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

CLIENT DATA NODE NAME NODE

HADOOP FRAMEWORK FOR BIG DATA

Data Analysis Using MapReduce in Hadoop Environment

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

Hadoop/MapReduce Computing Paradigm

Batch Inherence of Map Reduce Framework

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

CS 345A Data Mining. MapReduce

Map Reduce Group Meeting

Distributed File Systems II

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

A Review Paper on Map Reducing Using Hadoop

Fast and Effective System for Name Entity Recognition on Big Data

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Performance Analysis of Hadoop Application For Heterogeneous Systems

1. Introduction to MapReduce

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

CS370 Operating Systems

Introduction to MapReduce

Hadoop An Overview. - Socrates CCDH

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

A. Antony Prakash 1, Dr. A. Aloysius 2

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

MapReduce. U of Toronto, 2014

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Lecture 11 Hadoop & Spark

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS

Database Systems CSE 414

Chapter 5. The MapReduce Programming Model and Implementation

A Fast and High Throughput SQL Query System for Big Data

Introduction to MapReduce

A Review Paper on Big data & Hadoop

50 Must Read Hadoop Interview Questions & Answers

Embedded Technosolutions

Database Applications (15-415)

Adoption of E-Governance Applications towards Big Data Approach

A Review Approach for Big Data and Hadoop Technology

Analysis in the Big Data Era

BigData and Map Reduce VITMAC03

Jeffrey D. Ullman Stanford University

A SURVEY ON BIG DATA TECHNIQUES

Introduction to Data Management CSE 344

Improving the MapReduce Big Data Processing Framework

An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Distributed Systems CS6421

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce

Big Data and Object Storage

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

A brief history on Hadoop

CS 345A Data Mining. MapReduce

A Micro Partitioning Technique in MapReduce for Massive Data Analysis

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

SBKMMA: Sorting Based K Means and Median Based Clustering Algorithm Using Multi Machine Technique for Big Data

Introduction to Hadoop and MapReduce

CA485 Ray Walshe Google File System

TP1-2: Analyzing Hadoop Logs

Massive Online Analysis - Storm,Spark

South Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10

CS427 Multicore Architecture and Parallel Computing

Automatic Voting Machine using Hadoop

Dept. Of Computer Science, Colorado State University

Introduction to MapReduce

MapReduce, Hadoop and Spark. Bompotas Agorakis

Introduction to Data Management CSE 344

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

A Text Information Retrieval Technique for Big Data Using Map Reduce

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Cloud Computing CS

Document Clustering with Map Reduce using Hadoop Framework

On The Fly Mapreduce Aggregation for Big Data Processing In Hadoop Environment

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

ABSTRACT I. INTRODUCTION

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

The MapReduce Abstraction

Parallel Computing: MapReduce Jin, Hai

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

The MapReduce Framework

Online Bill Processing System for Public Sectors in Big Data

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Distributed Face Recognition Using Hadoop

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Big Data Management and NoSQL Databases

Transcription:

A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV Vizianagaram, INDIA Department of Information Technology, JNTUK-UCEV Vizianagaram, INDIA Department of CSE, JNTUK-UCEV Vizianagaram, INDIA DOI: 10.5281/zenodo.801301 KEYWORDS: HDFS, Hadoop, MapReduce, Name Node, Data Node. ABSTRACT Hadoop Distributed File System (HDFS) and MapReduce programming model is used for storage and retrieval of the big data. Big data can be any structured collection which results incapability of conventional data management methods. The Tera Bytes size file can be easily stored on the HDFS and can be analyzed with MapReduce. This paper provides introduction to Hadoop HDFS and MapReduce for storing large number of files and retrieve information from these files. In this paper we present our experimental work done on Hadoop by applying a number of files as input to the system and then analyzing the performance of the Hadoop system. We have studied the amount of bytes written and read by the system and by the MapReduce. We have analyzed the behavior of the map method and the reduce method with increasing number of files and the amount of bytes written and read by these tasks. INTRODUCTION Distributed storage system can be used for storing this huge amount of data. There is lots of a distributed storage system currently available. There can be three states of the data. Data can be in structured form, semi structured form and unstructured form. The amount of structured data is very less as compared to the semi structured and the unstructured data. Internet Data Center (IDC) has given statistics of the amount of structured and unstructured data present scenario and the predicted growth of the structured and the semi structured data. There are 3V first purposed for big data, Velocity, Veracity, and Volume. A distributed system should have access transparency, location transparency, concurrency transparency, failure transparency, heterogeneity, scalability, replication transparency, migration transparency. It should support fine- grained distribution of data tolerance for network partitioning it should provide a Distributed file system service for handling files. The main aim of a distributed file system (DFS) is to allow users of physically distributed Computers to share data and storage resources by using a common file system. DFS consists of multiple software components that are on multiple computers, but run as a single system. The power of distributed computing can clearly be seen in some of the most ubiquitous of modern applications. Figure 1. HDFS Architecture [58]

RELATED WORK Big data can be handled by using the Hadoop and MapReduce. The multiple node clusters can be set up using the Hadoop and MapReduce. Different size files can be stored in this cluster [5]. A shared disk analysis of the Hadoop is done [6]. Big data can also be provided as a service [7] in cloud computing (Data-as- service). With this big data analysis becomes important in a cloud computing environment. For management of data there is a traditional database management system. It is tested that Hadoop provides the best service in the context of cost, scalability and unstructured data [8]. HADOOP HDFS AND MAPREDUCE Hadoop comes with its default distributed file system which is Hadoop distributed file system [9]. It stores file in blocks of 64 MB. It can store files of varying size from 100MB to GB, TB. Hadoop architecture contains the Name node, data nodes, secondary name node, Task tracker and job tracker. Name node maintained the Metadata information about the block stored in the Hadoop distributed file system. Files are stored in blocks in a distributed manner. The Secondary name node does the work of maintaining the validity of the Name Node and updating the Name Node Information time to time. Data node actually stores the data. The Job Tracker actually receives the job from the user and split it into parts. Job Tracker then assigns these split jobs to the Task Tracker. Task Tracker has fixed number of the slots for running the tasks. The Job tracker selects the Task Tracker which has the free available slots. It is useful to choose the Task Tracker on the same rack where the data is stored this is known as rack awareness. With this inter rack bandwidth can be saved. Figure 2 shows the arrangement of the different component of Hadoop on a single node. In this arrangement all the component Name Node, Secondary Name Node, Data Node, and Task Tracker are on the same system. The User submits its job in the form of MapReduce task. The data Node and the Task Tracker are on the same system so that the best speed for the read and write can be achieved. Figure 2 Single Node Architecture of Hadoop Map when used in the HDFS first maps all the requested file blocks in the HDFS then reduces them according to the required result. [59]

Figure 3. MapReduce data flow with a single reduce task [60]

WORD COUNT WordCount is a simple MapReduce allocation. It counts the number of times a word is repeated into a text file of an input set. It imports Path, conf.*, io.*, maped.*, util.* of Hadoop. The Mapper is implemented by a map method; one line is processed at a time. The input can be strings or a text file. It splits the input into tokens, for separation it takes space as a tokenizer and uses the class String Tokenizer. Output is in the form of <<word>, 1>. Different maps generate these key value pairs for example. For input file having text like: Hii This is sriknth Welcome to Hadoop Lab Table1. Output of the Mapper First map Second map Third map output output output (<Hii>,1) (<This>,1) (<Welcome>,1) (<is>,1) (<Srikanth>,1) (<To>,1) (<Hadoop>,1) (<Lab>,1) The Reducer for these maps is implemented by the Reduce method. It takes input from all the maps and like in merge sort it combines the output of all the map and produce output as (< word>, n) Where n is the number of times the occurrence of that word in the input. The output of the reducer will be like (<Hii>, 1) (<This>, 1) (<Is>, 1) (<Srikanth>, 1) (<Welcome>, 1) (<To>, 1) (<Hadoop>, 1) (<Lab>, 1) The running component of the Hadoop can be checked using the jps command the output of the command is shown in Figure 4. We can access the name node at. Figure 4. JPS Output [61]

Figure 5. Output for wordcount CONCLUSION AND FUTURE WORK We have analyzed the performance of the map reduce task with the increase number of files. We have used the word count application of the Mapreduce for this analysis. The output shows that the Bytes written do not increase in the same proportion as compared to the amount of files increase. For our next work of study we will establish a cluster of nodes and then analyze the behavior of the Map reduce task with a number of files. This can be used in analyzing the files generated as logs. It can also be used for analyzing the sensor's output which is the number of files generated by the different sensing devices. REFERENCES 1. http://home.web.cern.ch/about/updates/2013/02/cern-data-centre- passes-100-petabytes 2. http://gigaom.com/2012/08/22/facebook-is-collecting-your-data-500- terabytes-a-day/ 3. Jeffrey Dean and Sanjay Ghemawat, Map Reduce: Simplified Data Processing on Large Clusters, Google, Inc. 4. http://raconteur.net/technology/big-ataor- big- statistics. 5. Aditya B. Patel, Manashvi Birla, Ushma Nair, Addressing Big Data Problem Using Hadoop and Map Reduce, NUiCONE-2012, 06-08DECEMBER, 2012 6. Anirban Mukherjee, Joydip Datta, Raghavendra Jorapur, Ravi Singhvi, Saurav Haloi, Wasim Akram, Shared Disk Big Data Analytics with Apache Hadoop 978-1-4673-2371-0/12/ 2012 IEEE Bernice Purcell The emergence of big data technology and analytics Journal of Technology Research 2013. 7. Definting Big Data Architecture Framework: Outcome of the Brainstorming Session at the University of Amsterdam, 17 July 2013.Presented at NBD-WG, 24 July 2013 [Online]. 8. Ahmed Eldawy, Mohamed F. Mokbel A Demonstration of Spatial Hadoop: An Efficient Map Reduce Framework for Spatial Data Proceedings of the VLDB Endowment, Vol. 6, No. 12Copyright 2013 VLDB Endowment 21508097/13/10. 9. A Workload Model for Map Reduce by Thomas A. deruiter, Master s Thesis in Computer Science, Parallel and Distributed Systems Group Faculty of Electrical Engineering, Mathematics, and Computer Science. Delft University of Technology, 2nd June 2012. 10. http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html [62]