High Performance Computing on MapReduce Programming Framework

Similar documents
Global Journal of Engineering Science and Research Management

Indexing Strategies of MapReduce for Information Retrieval in Big Data

Batch Inherence of Map Reduce Framework

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Hadoop/MapReduce Computing Paradigm

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Fast and Effective System for Name Entity Recognition on Big Data

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Data Analysis Using MapReduce in Hadoop Environment

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

CLIENT DATA NODE NAME NODE

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Distributed Systems CS6421

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Performance Analysis of Hadoop Application For Heterogeneous Systems

The Google File System. Alexandru Costan

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

A Review Approach for Big Data and Hadoop Technology

QADR with Energy Consumption for DIA in Cloud

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

MapReduce. U of Toronto, 2014

Analysis in the Big Data Era

Clustering Documents. Case Study 2: Document Retrieval

Distributed Face Recognition Using Hadoop

Introduction to MapReduce

Apache Flink: Distributed Stream Data Processing

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS

Efficient Algorithm for Frequent Itemset Generation in Big Data

Introduction to Data Management CSE 344

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Transaction Analysis using Big-Data Analytics

MapReduce, Hadoop and Spark. Bompotas Agorakis

Database Applications (15-415)

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

BigData and Map Reduce VITMAC03

Introduction to Hadoop and MapReduce

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

Distributed File Systems II

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

A Fast and High Throughput SQL Query System for Big Data

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

PACM: A Prediction-based Auto-adaptive Compression Model for HDFS. Ruijian Wang, Chao Wang, Li Zha

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

BIG DATA TESTING: A UNIFIED VIEW

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Document Clustering with Map Reduce using Hadoop Framework

SMCCSE: PaaS Platform for processing large amounts of social media

CS 61C: Great Ideas in Computer Architecture. MapReduce

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

HADOOP FRAMEWORK FOR BIG DATA

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

Database Systems CSE 414

Online Bill Processing System for Public Sectors in Big Data

Massive Online Analysis - Storm,Spark

Chapter 5. The MapReduce Programming Model and Implementation

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

New research on Key Technologies of unstructured data cloud storage

Research Article Apriori Association Rule Algorithms using VMware Environment

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Hadoop An Overview. - Socrates CCDH

MOHA: Many-Task Computing Framework on Hadoop

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING ENVIRONMENT

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Developing MapReduce Programs

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

A Text Information Retrieval Technique for Big Data Using Map Reduce

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Lecture 11 Hadoop & Spark

CS 345A Data Mining. MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Cooperation between Data Modeling and Simulation Modeling for Performance Analysis of Hadoop

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016 ISSN

Large Scale Processing with Hadoop

Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University

Decision analysis of the weather log by Hadoop

1. Introduction to MapReduce

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Introduction to MapReduce

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

MPLEMENTATION OF DIGITAL LIBRARY USING HDFS AND SOLR

CLOUD-SCALE FILE SYSTEMS

Improved MapReduce k-means Clustering Algorithm with Combiner

Dept. Of Computer Science, Colorado State University

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

A Review Paper on Big data & Hadoop

A Micro Partitioning Technique in MapReduce for Massive Data Analysis

Transcription:

International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming Framework Shaik Mohammed Gouse 1 Department of Computer Science and Engineering KL University gouse@kluniversity.in Abstract Big Data may be a large quantity of data that can't be managed by the standard data management system. Hadoop may be a technological answer to Big Data. Hadoop Distributed filing system (HDFS) and MapReduce programming model is employed for storage and retrieval of the massive data. The Tera Bytes size file will be simply keep on the HDFS and might be analyzed with MapReduce. This paper provides introduction to Hadoop HDFS and MapReduce for storing huge amount of files and retrieve info from these files. Our experimental work done on Hadoop by applying variety of files as input to the system and so analyzing the performance of the Hadoop system. we've got studied the number of bytes written and read by the system and by the MapReduce. we've got analyzed the behavior of the map technique and therefore the reduce technique with increasing range of files and therefore the quantity of bytes written and read by these tasks. Keywords: HDFS, Hadoop, MapReduce, Terasort, Teragen, Teravalidate, Name Node, Secondary Name Node, Data Node, Job Tracker, Task Tracker 1. Introduction In this big technological environment, the amount of data generated is increasing at a very big rate. Computer Engineers at European Council for Nuclear Research (CERN) announced that the CERN Data Centre has recorded over 100 Petabytes of physics data in the last 20 years. Experiment in the Large Hadron Collider (LHC) generated about 75 Petabytes of this data in the last three years. Storing it is a challenge [1]. With millions of users, Facebook is collecting a lot of data. Everything is interesting in some context. So it is useful to take advantage of this data. Storage of this amount of data is difficult but distributed storage system provides an efficient scheme for management of this data. Distributed storage system can be used for storing this huge amount of data. There is lots of a distributed storage system currently available. There can be three states of the data. Data can be in structured form, semi structured form and unstructured form. The amount of structured data is very less as compared to the semi structured and the unstructured data. Internet Data Center (IDC) has given statistics of the amount of structured and unstructured data present scenario and the predicted growth of the structured and the semi structured data. There are 3V first purposed for big data, Velocity, Veracity, and Volume [2]. This volume was the size of the file. What if the number of files increased that is to be handled by the Hadoop distributed file system? A distributed system should have access transparency, location transparency, concurrency Article history: Received (December 2, 2014), Review Result (February 8, 2015), Accepted (March 6, 2015) Print ISSN: 2205-8478, eissn: 2207-3973 IJPCCEM Copyright c 2015 GV School Publication

High Performance Computing on MapReduce Programming Framework transparency, failure transparency, heterogeneity, scalability, replication transparency, migration transparency; it should support fine- grained distribution of data tolerance for network partitioning it should provide a Distributed file system service for handling files. The main aim of a distributed file system (DFS) is to allow users of physically distributed Computers to share data and storage resources by using a common file system. DFS consists of multiple software components that are on multiple computers, but run as a single system. The power of distributed computing can clearly be seen in some of the most ubiquitous of modern applications. So for storage of the Big Data and for better management of the big data we can use the distributed file. 2. Related work Big data can be handled by using the Hadoop and MapReduce. The multiple node clusters can be set up using the Hadoop and MapReduce. Different size files can be stored in this cluster [5]. 3. Hadoop HDFS and Mapreduce Hadoop comes with its default distributed file system that is Hadoop distributed file system [3]. It stores file in blocks of 64 MB. It can store files of varying sizefrom 100MB to GB, TB. Hadoop architecture contains the Name node, data nodes, secondary name node, Task tracker and job tracker. Name node maintained the Metadata information about the block stored in 28 Shaik Mohammed Gouse

International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 the Hadoop distributed file system. Files are stored in blocks in a distributed manner. The Secondary name node does the work of maintaining the validity of the Name Node and updating the Name Node Information time to time. Data node actually stores the data. Figure 1. HDFS architecture The Job Tracker actually receives the job from the user and split it into parts. Job Tracker then assigns these split jobs to the Task Tracker. Task Tracker runs on the Data node they fetch the data from the data node and execute the task. They continuously talk to the Job Tracker. Job Tracker coordinates the job submitted by the user. Task Tracker has fixed number of the slots for running the tasks. The Job tracker selects the Task Tracker, which has the free available slots. It is useful to choose the Task Tracker on the same rack where the data is stored this is known as rack awareness. With this inter rack bandwidth can be saved. Figure 2 shows the arrangement of the different component of Hadoop on a single node. In this arrangement all the component Name Node, Secondary Name Node, Data Node, Job Tracker, and Task Tracker are on the same system. The User submits its job in the form of MapReduce task. The data Node and the Task Tracker are on the same system so that the best speed for the read and write can be achieved. Copyright c 2015 GV School Publication 29

High Performance Computing on MapReduce Programming Framework 4. Word count Figure 2. MapReduce framework WordCount is a simple MapReduce allocation. It counts the number of times a word is repeated into a text file of an input set. It imports Path, conf.*, io.*, mapped.*, util.* of Hadoop. The Mapper is implemented by a map method; one line is processed at a time. The input can be strings or a text file. It splits the input into tokens, for separation it takes space as a tokenizer and uses the class String Tokenizer. Output is in the form of <<word>, 1>. Different maps generate these key value pairs for example. For input file having text like: Student college workshop Student Lab The Reducer for these maps is implemented by the Reduce method. It takes input from all the maps and like in merge sort it combines the output of the entire map and produce output as (< word>, n) Where n is the number of times the occurrence of that word in the input. The output of the reducer will be like (<Student>, 2) (<college>, 1) (<Lab>, 1) (<workshop>, 1) 5. Experimental setup Dell Precision T3500 system is used with, Intel(R) Xeon(R) CPU W3565 @3.20GHz 3.20GHz, 4GB RAM, 64-bit. Platform Ubuntu 12.04 is on it the installation size was 30GB. Installation of Ubuntu is done using the Wubi installer. At Hadoop website Hadoop-1.2.1- bin.tar is available for download. Hadoop is installed on the Linux file system. 30 Shaik Mohammed Gouse

International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 Figure 3. Jps output 6. Conclusion Figure 4. File bytes read by MapReduce task In this paper we focused on the analysis of word count application with respect to MapReduce paradigm by considering a massive volume of large-scale datasets. In this paper we discussed about Hadoop Distributed File Structure and MapReduce for storing and processing large amount of data. In this paper we present our experimental work done on Hadoop by applying a number of files as input to the system and then analyzing the performance of the Hadoop system. This can be used in analyzing the files generated as logs. It can also be used for analyzing the sensor's output that are the number of files generated by the different sensing devices. Copyright c 2015 GV School Publication 31

High Performance Computing on MapReduce Programming Framework References [1] http://home.web.cern.ch/about/updates/2013/02/cern-data-centre-passes100-petabytes [2] Big Data has Big Potential to Improve Americans Lives, Increase Economic Opportunities, Committee on Science, Space and Technology April (2013). URL http://science.house.gov/press-release. [3] K. Shvachko, H. Kuang, S. Radia and R. Chansler, The Hadoop Distributed File System, Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium, 3-7 May (2010). [4] AP. Sivan, 1. Johns and J. Venugopal, "Big Data Intelligence in Logistics Based On Hadoop And Map Reduce", International Conference on Innovations in Engineering and Technology (ICIET'14), 21-22 March, India, (2014). [5] X. Wang, C. Yang, and J. Zhou, Clustering aggregation by probability accumulation, Pattern Recognition, Vol. 42, No. 5, pp. 668-675, (2009). 32 Shaik Mohammed Gouse