Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

Similar documents
Data Analysis Using MapReduce in Hadoop Environment

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Embedded Technosolutions

BIG DATA & HADOOP: A Survey

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Document Clustering with Map Reduce using Hadoop Framework

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

High Performance Computing on MapReduce Programming Framework

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

A brief history on Hadoop

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

A Review Approach for Big Data and Hadoop Technology

The MapReduce Framework

ABSTRACT I. INTRODUCTION

An improved MapReduce Design of Kmeans for clustering very large datasets

Apache Spark and Hadoop Based Big Data Processing System for Clinical Research

Mounica B, Aditya Srivastava, Md. Faisal Alam

CLIENT DATA NODE NAME NODE

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

1. Introduction to MapReduce

HADOOP FRAMEWORK FOR BIG DATA

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

MapReduce Algorithms

Databases 2 (VU) ( / )

International Journal of Advance Engineering and Research Development. Performance Comparison of Hadoop Map Reduce and Apache Spark

BIG DATA TESTING: A UNIFIED VIEW

Parallel learning of content recommendations using map- reduce

Improved MapReduce k-means Clustering Algorithm with Combiner

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

On The Fly Mapreduce Aggregation for Big Data Processing In Hadoop Environment

Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Fast and Effective System for Name Entity Recognition on Big Data

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Cassandra- A Distributed Database

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

SBKMMA: Sorting Based K Means and Median Based Clustering Algorithm Using Multi Machine Technique for Big Data

Twitter data Analytics using Distributed Computing

Data Partitioning and MapReduce

Clustering Documents. Case Study 2: Document Retrieval

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

Big Data Hadoop Stack

International Journal of Advanced Research in Computer Science and Software Engineering

A Review Paper on Big data & Hadoop

Cloud Computing Techniques for Big Data and Hadoop Implementation

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

Global Journal of Engineering Science and Research Management

Text Classification Using Mahout

732A54/TDDE31 Big Data Analytics

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Online Bill Processing System for Public Sectors in Big Data

SBKMA: Sorting based K-Means Clustering Algorithm using Multi Machine Technique for Big Data

Mitigating Data Skew Using Map Reduce Application

The Future of High Performance Computing

Clustering Lecture 8: MapReduce

Churn Prediction Using MapReduce and HBase

Paradigm Shift of Database

Introduction to MapReduce Algorithms and Analysis

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

Volume 3, Issue 11, November 2015 International Journal of Advance Research in Computer Science and Management Studies

Analyzing Outlier Detection Techniques with Hybrid Method

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Efficient Algorithm for Frequent Itemset Generation in Big Data

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Networking Issues For Big Data

Distributed Face Recognition Using Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Jeffrey D. Ullman Stanford University

A Survey Of Issues And Challenges Associated With Clustering Algorithms

Large-Scale GPU programming

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS

Project Design. Version May, Computer Science Department, Texas Christian University

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Web Mining Evolution & Comparative Study with Data Mining

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Big Data landscape Lecture #2

Storm Identification in the Rainfall Data Using Singular Value Decomposition and K- Nearest Neighbour Classification

Distributed Systems CS6421

I. INTRODUCTION II. RELATED WORK.

Un-moderated real-time news trends extraction from World Wide Web using Apache Mahout

Hadoop and Map-reduce computing

Introduction to MapReduce

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Figure 1 shows unstructured data when plotted on the co-ordinate axis

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Transcription:

Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue on 5 th National Conference on Recent Trends in Information Technology 2016 Conference Held at P.V.P. Siddhartha Institute of Technology Kanuru, Vijayawada, India Efficient Big Data Processing in Hadoop MapReduce Kvn Krishna Mohan, K Prem Sai Reddy, K Geetha Sri, A Prabhu Deva, M. Sundarababu(Asst Professor) Department of IT, PVP Siddhartha Institute of Technology, Andhra Pradesh, India Abstract In this work, we propose how the Big Data Processing is done very efficiently in Hadoop using Map Reduce Technique. Big Data may include the variables like analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying and information privacy. The World is on growing content. Everyone is facing the challenges of the Big Data. Hadoop is an open-source software framework written in Java, used for processing vast amounts of unstructured data. Hadoop, which runs on the Linux Machine, supports up to 10,000 cores and produced data. Hadoop has a different set of the framework which splits the data into blocks and distributes them among cluster nodes. Map Reducing undergoes the implementation and processing of the data set with parallel and distributed algorithm in a cluster. Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters. I. INTRODUCTION Hadoop is an open source framework which is used to store large amounts of data in a distributed environment. Hadoop is a Java-based programming shell which used for Processing of data among the different clusters. It is designed to scale up from single servers to thousands of machines. Hadoop consists of two steps-transform & Repository. Hadoop is a lot more Flexible, Economical, and Scalable. Hadoop has fault tolerance. Map reduce a technique used for splitting the large amounts of data. Map reduce consists of two tasks namely Map and Reduce. The given input data appears in the form of small chunks by using a mapper.after generating the small pieces of data, the data is shuffled, and the data which come from mapper is reduced and stored in the database. Map-reduce groundwork operates on a <Key-Value >pair. K-Means is one of the clustering technique in which it s an easy way to classify the data points to group of clusters. Partitioning of data into small no of clusters. This technique is rapid, robust and easy to understand. II. RELATED WORK The history of the clustering is so long that it contains a variety of studies. Clustering deemed in a parallel fashion, like Map Reduce Hadoop enlightens the data accuracy and reliability when compared to the other softwares. For example consider Facebook, which is one of the largest tech giants in the world has only unstructured data to process. And over millions of the data is to be retrieved within a few seconds. Unless the data retrieval is within seconds of time, nobody spares their time to use that website. Over the past, there is a vast increase in the data usage. In order to overcome the problem, everyone should come with a solution to compete with others in this competing world to stand first. Doug Cutting and Mike Cafarella after being inspired by the white paper introduced by the Google thought of creating an open source software framework for supporting distribution for Nutch search engine project. These steps lead to the main answer to the big data analytical problems. Many IT companies have faced huge downfall for inconsistent maintenance of the data. Hadoop makes it easier to process and storing bulk amounts of data. III. THEORETICAL ANALYSIS K MEANS, MAP REDUCE Clustering is a function of grouping the set of objects in which similar objects are combined into a single cluster. The main purpose of this clustering is unstructured data objects to structured data objects. Dissimilar Objects form into other clusters. Clustering should deal with noise and outliers. Clustering deals with different types of attributes. Clustering is used in various kinds of fields like marketing, biology, etc. K-means algorithm is used to partition the different set of data into clusters. The K-means is a cluster algorithm n-objects based on attributes into k-partitions where k<n. It assumes that the object attributes forms a vector space. This algorithm is used to classify or to group the objects based on attributes where k is a positive integer. The K-Means algorithm works: Begin with a decision on the value of the k= number of clusters. Initial partition classifies the data into k clusters. Then each sample is taken in sequence and computes its distance from the centroid of each of the cluster. If the sample is not present in the cluster which is closest to the centroid switch this sample to that cluster and update the centroid of the cluster. Repeat the process until the convergence is achieved. Use MapReduce only if you have enormous data. Use a lot of defensive checks. Testing can save a lot of time. This map reducing technique is best for less time to consume. Map reducing is a 2016, IJARCSSE All Rights Reserved Page 96

functional programming for analyzing the single record. Map reduce consists of a map and reduce in which the map processes the given input data and that data is shuffled, reduced into small required data which is stored in the database. K-Means Algorithm 1: k-means++ Input: K,the number of clusters X={x1,x2,.,xn},a set of data points Output: C={c1,c2,..,ck}. 1. C φ 2. Choosen one center x uniformly X at random,c=c {x}. 3. Repeat 4. Choose x X with the probability D(x)^2/ D(x)^2 5.C=C {x} 6 Until k centers are choosen 7 Proceed as with the standard k-means algorithm Algorithm 2: Mapper phase of k-means++ initialization Input: k, the number of clusters X = {x1,x2,..xn}, a set of data points. Output: (num[i], ci), i = 1, 2,,k. num[i] denotes the number of points that center i represents_ 1 C φ 2 Choose one center X uniformly from X at random, C = C U {x}. 3 for i = 1; i < k; + + do 4 L num[i] =(); 5 while ICI < k do 6 Compute D^2 (x) between x X and its Nearest center that has already been chosen 7 Choose x with the probability D^2(x)/ D(x)^2 8 C C U{X} 9 for i = 1; i < n; + + do 10 Find the nearest center ci C for xi 11 nartn[i]++ 12 return (num[i], ci) Algorithm 3: Reducer phase of k,-means++ initialization Input: k, the number of clusters, X, the set of {num, c) Output C= {c1,c2., ck}. 1. C 2 Choose one center x uniformly from X at random, C = C U {x}. 3 while ICI < K: do 4 Compute D^2 (x) between x X and its nearest center that has already been chosen 5 Choose x with the probability num, * D^2(x)/ D (x) ^2 6 C C U{X} 7 return C 2016, IJARCSSE All Rights Reserved Page 97

IV. DATA ANALYSIS A Medical data set is analyzed using the methods of map reduce and the best technique that is K-Means in data mining. Data mining is straightforward and easy to analyze the data compared to neural networks and Artificial Intelligence. In data mining, some of the functionalities used, i.e., clustering, classification, regression or prediction, and association, etc. The clustering is used to analyze the large amounts of data into required quantities of evidence. Finally, the clustering method is taken considerably to explain the different types of hospital data consists of large data sets. We choose this K-means algorithm where its speed of running the large data sets is incomparable to other techniques. The different patient s diseases are considered and are evaluated into different clusters in which the single cluster consists of same disease patients. The output is either in no format or the naming format. The output is not structured. So, to get the graph structure we used GIPHI tool. The evaluated output is very accurate. To assess the data we used two different type of clusters. The 2 clusters are processed using map reduce method to get the desired outcome. V. EXPERIMENTAL RESULTS After the analysis of the data, Hadoop is installed on a Linux Machine and the sample data is inserted into the repository. And Later on, the Map-Reduce code is executed along with an initialization cluster. We can browse all the files in the default URL: http://localhost:50070/. We need to specify all the paths for running the map-reduce like the input directory, output directory and the cluster initialization directory along with the algorithm we used. Hadoop provides a reliable output in mapping and analyzing of the data. This Data can be employed for further representation. Figure 1: Running the Hadoop on local host Figure 2: Browse to the user directory 2016, IJARCSSE All Rights Reserved Page 98

Figure 3: All the inserted data is present in the user directory Figure 4: Output Folder consists of two chunks of data Figure 5: Output Data 2016, IJARCSSE All Rights Reserved Page 99

VI. CONCLUSION This Paper concludes the importance of the Hadoop and its different techniques in the Big-Data. Though they are much scalable software available in the market, Hadoop is the preferred by more than half of the Fortune 50 companies. There is a considerable growth in the usage of the Hadoop from the last decade. Almost all the leading companies like Yahoo, Google, Amazon, Facebook, IBM, EMC... prefer Hadoop over much more other software. There are many companies which are giving up themselves due to the inadaptability of the big data techniques. As of now more than half of the fortune 50 companies are using Hadoop. Processing of the unstructured data is also a key feature, and this is also the main reason for preferring this over others. REFERENCES [1] Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI '2004), 137{150 (2004) [2] Lammel, R. Google's MapReduce programming model {Revisited. Sci. Computer. Program., vol.68, issue 3, 208{237 (2007) [3] White, T.: Hadoop: The Definitive Guide. Second Edition, Yahoo Press, (2009) [4] Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce (Synthesis Lectures on Human Language Technologies). Morgan and Claypool Publishers (2010) [5] Afrati, F. N., Ullman, J. D. Optimizing joins in a map-reduce environment. 13th International Conference on Extending Database Technology, 99{110 (2010) [6] Apache Hadoop http://hadoop.apache.org/ [7] MapReduce Design of K-Means Clustering Algorithm, International Conference on Multimedia Communications, 978-1-4799-0604-8/13 2013 IEEE, 2013 2016, IJARCSSE All Rights Reserved Page 100