Performance Optimization of Big Data Processing using Clustering Technique in Map Reduces Programming Model

Similar documents
Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

HADOOP: A NEW APPROACH FOR DOCUMENT CLUSTERING

Task scenarios Outline. Scenarios in Knowledge Extraction. Proposed Framework for Scenario to Design Diagram Transformation

Chapter 3 Classification of FFT Processor Algorithms

Euclidean Distance Based Feature Selection for Fault Detection Prediction Model in Semiconductor Manufacturing Process

MapReduce and Hadoop. Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata. November 10, 2014

Bayesian approach to reliability modelling for a probability of failure on demand parameter

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

Pruning and Summarizing the Discovered Time Series Association Rules from Mechanical Sensor Data Qing YANG1,a,*, Shao-Yu WANG1,b, Ting-Ting ZHANG2,c

ISSN (Print) Research Article. *Corresponding author Nengfa Hu

3D Model Retrieval Method Based on Sample Prediction

Improving Template Based Spike Detection

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

Evaluation scheme for Tracking in AMI

Sectio 4, a prototype project of settig field weight with AHP method is developed ad the experimetal results are aalyzed. Fially, we coclude our work

Fast Fourier Transform (FFT) Algorithms

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

A Parallel DFA Minimization Algorithm

. Written in factored form it is easy to see that the roots are 2, 2, i,

Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today

A Study on the Performance of Cholesky-Factorization using MPI

New HSL Distance Based Colour Clustering Algorithm

What are Information Systems?

VISUALSLX AN OPEN USER SHELL FOR HIGH-PERFORMANCE MODELING AND SIMULATION. Thomas Wiedemann

Fuzzy Rule Selection by Data Mining Criteria and Genetic Algorithms

Mining from Quantitative Data with Linguistic Minimum Supports and Confidences

BAYESIAN WITH FULL CONDITIONAL POSTERIOR DISTRIBUTION APPROACH FOR SOLUTION OF COMPLEX MODELS. Pudji Ismartini

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Performance Comparisons of PSO based Clustering

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Python Programming: An Introduction to Computer Science

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

How do we evaluate algorithms?

Efficient Hardware Design for Implementation of Matrix Multiplication by using PPI-SO

Research on K-Means Algorithm Based on Parallel Improving and Applying

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

EE123 Digital Signal Processing

Optimization for framework design of new product introduction management system Ma Ying, Wu Hongcui

Analysis of Server Resource Consumption of Meteorological Satellite Application System Based on Contour Curve

A New Bit Wise Technique for 3-Partitioning Algorithm

Ontology-based Decision Support System with Analytic Hierarchy Process for Tour Package Selection

Optimization of Multiple Input Single Output Fuzzy Membership Functions Using Clonal Selection Algorithm

DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO

GPUMP: a Multiple-Precision Integer Library for GPUs

An Algorithm to Solve Multi-Objective Assignment. Problem Using Interactive Fuzzy. Goal Programming Approach

1. SWITCHING FUNDAMENTALS

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs

Mobile terminal 3D image reconstruction program development based on Android Lin Qinhua

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS)

Cubic Polynomial Curves with a Shape Parameter

Evaluation of Distributed and Replicated HLR for Location Management in PCS Network

Goals of this Lecture Activity Diagram Example

Improving Information Retrieval System Security via an Optimal Maximal Coding Scheme

GE FUNDAMENTALS OF COMPUTING AND PROGRAMMING UNIT III

Rapid Frequent Pattern Growth and Possibilistic Fuzzy C-means Algorithms for Improving the User Profiling Personalized Web Page Recommendation System

Architectural styles for software systems The client-server style

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence

Elementary Educational Computer

Software development of components for complex signal analysis on the example of adaptive recursive estimation methods.

Image Segmentation EEE 508

New Fuzzy Color Clustering Algorithm Based on hsl Similarity

Ones Assignment Method for Solving Traveling Salesman Problem

Probabilistic Fuzzy Time Series Method Based on Artificial Neural Network

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Extending The Sleuth Kit and its Underlying Model for Pooled Storage File System Forensic Analysis

Computers and Scientific Thinking

Enhancing Cloud Computing Scheduling based on Queuing Models

Enhancing Efficiency of Software Fault Tolerance Techniques in Satellite Motion System

CSCI 5090/7090- Machine Learning. Spring Mehdi Allahyari Georgia Southern University

Our second algorithm. Comp 135 Machine Learning Computer Science Tufts University. Decision Trees. Decision Trees. Decision Trees.

Pattern Recognition Systems Lab 1 Least Mean Squares

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

Outline. Research Definition. Motivation. Foundation of Reverse Engineering. Dynamic Analysis and Design Pattern Detection in Java Programs

EFFECT OF QUERY FORMATION ON WEB SEARCH ENGINE RESULTS

WEBSITE STRUCTURE IMPROVEMENT USING ANT COLONY TECHNIQUE

Adaptive Resource Allocation for Electric Environmental Pollution through the Control Network

A Note on Least-norm Solution of Global WireWarping

arxiv: v2 [cs.ds] 24 Mar 2018

A Modified Multiband U Shaped and Microcontroller Shaped Fractal Antenna

Markov Chain Model of HomePlug CSMA MAC for Determining Optimal Fixed Contention Window Size

+ Cluster analysis. a generalization can be derived for each cluster and hence processing is done batch wise rather than individually

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

Keywords Software Architecture, Object-oriented metrics, Reliability, Reusability, Coupling evaluator, Cohesion, efficiency

Availability Enhancement for Cloud Services by Migration based Rejuvenation: Analytical Modeling

Chapter 3 MATHEMATICAL MODELING OF TOLERANCE ALLOCATION AND OVERVIEW OF EVOLUTIONARY ALGORITHMS

Heuristic Approaches for Solving the Multidimensional Knapsack Problem (MKP)

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Analysis of Algorithms

Journal of Chemical and Pharmaceutical Research, 2013, 5(12): Research Article

Stone Images Retrieval Based on Color Histogram

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Analysis of Algorithms

n Learn how resiliency strategies reduce risk n Discover automation strategies to reduce risk

Clustering and Classifying Diabetic Data Sets Using K-Means Algorithm

World Scientific Research Journal (WSRJ) ISSN: Research on Fresnel Lens Optical Receiving Antenna in Indoor Visible

Transcription:

Performace Optimizatio of Big Data Processig usig Clusterig Techique i Map Reduces Programmig Model Ravidra Sigh Raghuwashi Samrat Ashok Techological Istitute VIDISHA,M.P Idia Deepak Sai Samrat Ashok Techological Istitute VIDISHA, M.P Idia ABSTRACT The geeratio of techology ad requiremet fulfill the demad of digital uiverse data. Day to day the digital uiverse data are exploded i terms of megabyte ad petabyte. The explodig rate of data demads the ew geeratio of techology such as big data processig. I this paper optimized the performace of map reduce programmig model for the ehacemet of data processig. The modified model of programmig used clusterig techique. the clusterig techique icorporate the process of map data i terms of task group. The task group of map data correlated with differet idex of data for the processig of data ode. The proposed model implemeted i Hadoop framework ad programmed i java. For the evaluatio of performace used three stadard datasets ad measure the processig time ad cout value of file. Keys Big Data, Hadoop, MapReduce, Clusterig, Optimizatio 1. INTRODUCTION The icreasig rate of digital data faced a problem of processig, speed ad aalysis. The ormal file system ad framework caot support the cocept of Nosql. The cocept of Nosql precedes the ustructured ad uformatted data for the aalysis ad processig. For the processig of big data used map reduces fuctio process. The map reduces fuctio process basically based o java programmig model[1,2]. The java programmig model geerated the value of key task for the processig of data. I this dissertatio proposed prototype model for the processig of data. The prototype mode based o the cocept of data miig. The process of data miig gives ad precedes the various algorithms for the processig of data. The data miig techique provide associatio rule miig techique for the processig of relatio data[3,4]. Here miig clusterig techique are used for the improvemet of map reduces file processig i HADOOP data aalysis tool. the modified model of Map reduces simulated i Hadoop framework. The processig of map reduces fuctio based o two attribute data oe is cluster of data ad other is key value. The value of key geerates the processig of group of data for the process of aalysis[11]. The modified map reduces fuctio compoet ecapsulate the processig of DB scale clusterig techique. the DB scale clusterig techique defie the value of rage of group data for the processig of map cerates block. More tha 3% improvemet has bee observed i some cases of applicatios which are quite impressive from computatioal perspective. It has also bee observed that, the time for clusterig becomes almost statioary for higher umber of odes eve the iput volume of data has bee icreased from 7 millio to 1 millio[12]. Thus, DB scale beig very useful clusterig techique, usig it i cloud eviromet for processig Big Data has some iheret advatages ad may be used for various applicatio. For processig ad aalysis of datasets may tools are available ad the most popular ad widely used is Apache Hadoop [7]. Hadoop ca hadle all types of data such as structured, ustructured, log files, pictures etc. Hadoop supports redudacy, scalability, parallel processig, ad distributed architecture [7]. I geeral, distributed computig [8] is a field of computer sciece that ivolves multiple computers, located remotely from each other. Each computer has a commo shared role i a computatio problem ad coordiates their actios by message passig. Schedulig problem is also faced i other computig systems. The work i [9] addresses schedulig i geeral-purpose distributed computig eviromet.rest of this paper is orgaized as follows i Sectio II discusses MapReduce programmig Model, Sectio III proposed algorithm IV. Experimetal result aalysis Fially, cocluded i sectio V. 2. MAPREDUCE PROGRAMMING MODEL MapReduce is a programmig model which process large amout data i parallel way o large clusters of machies [14]. MapReduce program maily cosists of two fuctios i.e. map fuctio ad reduce fuctio as described i Fig. 1. Map fuctio takes value as iput ad geerates key: value pair. Whe all the values get a key, this programmig model groups all the values together accordig to their keys. This is the job of combier module. The output of the combier module becomes the iput of reduce fuctio. Reduce fuctio takes a key ad list of values as a iput. Reduce fuctio processes o the iput ad geerates output as per requiremet. Users defie map ad reduce fuctio ad the rutime architecture automatically distribute the task, take care of machie failures, hadles commuicatio amog differet odes, balace the load amog differet odes. Hadoop provides MapReduce rutime system alog with a distributed file system which provides high fault tolerace ad scalability. Hadoop distributed file system replicate the data across the ode which icreases availability of data. The file system uses TCP/IP for commuicatio. There are five kids of server available i hadoop as show i Fig.4.2. 2. Name ode, data ode, secodary ame ode hadle data storage, retrieval ad fault tolerace. Job tracker ad task tracker hadle map reduce computatioal part. 42

Figure 1 shows that processig of Map Reduces data segmet for the process of aalysis. 3. PROPOSED ALGORITHM The proposed model describe ito two differet sectio oe is the groupig of data for the processig of clusterig task ad other is reduces process for Hadoop framework. I this sectio, we have described about clusterig process of Map reduce framework for big data aalysis. the proposed Map reduce framework based data aalysis system, which cosists of three importat fuctios: Map, Itermediate system ad Reduce. The overall operatio of proposed architecture is give by DS M IMS R Fial value (1) Where, DS is the dataset, M is the mapper, IMS is the itermediate system MAPPER OPERATION A big data dataset DS, it is firstly divided ito umber of subsets. Subset cotais may attributes. DS i = DS 1 + DS 2 +.. + DS, < i < m (2) where, DS 1, DS 2 ad DS are the subsets. Normally, map is writte by the user, takes a iput pair ad geerates a set of itermediate key/value pairs. I map reduce architecture figure 2, for each data, we associate a map operatio. The first step is to partitio the iput dataset, typically stored i a distributed file system, amog the computers that execute the map fuctioality. From the logic perspective, all data is treated as a Key (K), Value (V) pair. Each attributes i the iput dataset is represeted as a <key1, value1>. I the secod step, each mapper applies the map fuctio o each sigle attribute to geerate a list o the form(< key2, value2 >), where () represets lists of legth zero or more. Map < key1, value1 > (< key2, value2 >) (3) I this cotext, firstly iitializes the ecessary structures, primarily iput key ad value. For this purpose, we have utilized the firefly ad aïve bayes classifier. Firefly algorithm based feature selectio process is explaied below: Firstly, we have developed a modified dataset from the traiig dataset for this fitess selectio purpose. The modified dataset cotais oly idetified attributes ( 1 s). This is created based o the iitial firefly. The this modified dataset is classified usig aive bayes classifier, we obtai mea ad variace. Mea μ = 1 Variece σ 2 = 1 x i (4) (x i μ) (5) Where, x i is the i th attribute is the umber of attribute 3.1 Reducer Operatio The third step is to shuffle the output of the mappers ito the systems that execute the reduce fuctioality. A reduce operatio takes all values represeted by the same key i the itermediate list ad processes them accordigly, emittig a fial ew list. Here, oce the best feature space is idetified through firefly algorithm, the big data aalysis is doe usig the aive bayes classifier. Output from all Map odes, <key1, ad <value1> etries, are grouped by key1values before beig distributed to Reduce operatio. It is the tur of Reduce operatio to combie value1 values accordig to a specific key1. Product of Reduce operatio may be i format of a list, <value2>or just a sigle value, value2. DS(< key1, value1 >) M (< key2, value2 >) R((< key2, value2 >) value2 (7) Aalysis usig DB Scale Validatio of each i comig iput data is attaied by tokeizig the attribute ad usig the pre-calculated attribute probability of each feature to classify the icomig value as reduced output data usig followig aïve bayes expressio. Firstly, calculate the mea ad variace equatio (4) ad (5) usig aive bayes classifier, ad the aalysis process is followed by For the aalysis as data the posterior posterior RO f 1, f 2,... f i = P(RO P(f io evidece )) For the aalysis as used data the posterior is give by Where, posterior RO O f 1, f 2,... f i = P(RO P f i RO O )) evidece (9) (8) 43

evidece = P O O) P f i O) + P(RO O ) p f i RO The evidece (also termed ormalizig costat) may be calculated sice the sum of the posterior probabilities must equal oe. tweet 15 51 27 32 11 6 32 335 135 45 42 1245 47 37 Table 2: Shows that the performace evaluatio of umber of, umber of cout, ad iput usig rt-tweet dataset. cout 15 45 32 345 11 47 45 135 65 37 38 1245 5 42 1175 51 27 32 Table 3: Shows that the performace evaluatio of umber of, umber of cout, ad iput usig tpcds-setup dataset. Figure 2 sows that processig of Map reduces file system usig DB scale clusterig techique Figure 3 sows that processig of Map reduces for the geeratio of data based o cluster ode. 4. EXPERIMENTAL ANALYSIS The proposed algorithm imlemeted i Hadoop tools.the hadoop tool is ope source liuex based freamwork. The hadoop freamwork proced the MapReduce fuctio for the aalysis of data. The propgrammig model of MapReduces i JAVA JDK compliatio tool. For the evlautio of proposed model used cout program for the aalysis of differet dataset. Table 1: Shows that the performace evaluatio of umber of, umber of cout, ad iput usig batch-tweet dataset. cout Batch- 1 5 35 cout 15 47 28 31 1 48 24 26 15 5 36 455 11 551 355 135 6 37 39 Table 4: Shows that the performace evaluatio of umber of, umber of cout, ad iput usig zipcode-setup dataset. cout 1 51 38 85 27 325 1 6 32 39 1135 45 435 1245 47 37 44

45 35 25 15 1 5 Figure 4: Shows that the comparative performace evaluatio graphs usig ad with batch-tweet dataset. 5 45 35 25 15 1 5 Comparative performace graph for ad with batch-tweet dataset Comparative performace graph for ad with dataset Figure 5: Shows that the comparative performace evaluatio graphs usig ad with rt-tweet dataset. 5 1 Comparative performace graph for ad with tpcds-setup dataset Figure 6: Shows that the comparative performace evaluatio graphs usig ad with tpcds-setup dataset. 5. CONCLUSION & FUTURE SCOPE I this paper modified the map reduces programmig model usig DB scale clusterig techique. the modified model of Map reduces simulated i Hadoop framework. The processig of map reduces fuctio based o two attribute data oe is cluster of data ad other is key value. The value of key geerates the processig of group of data for the process of aalysis. The modified map reduces fuctio compoet ecapsulate the processig of DB scale clusterig techique. the DB scale clusterig techique defie the value of rage of group data for the processig of map cerates block.more tha 3% improvemet has bee observed i some cases of applicatios which are quite impressive from computatioal perspective. It has also bee observed that, the time for clusterig becomes almost statioary for higher umber of odes eve the iput volume of data has bee icreased from 7 millio to 1 millio. Thus, DB scale beig very useful clusterig techique, usig it i cloud eviromet for processig Big Data has some iheret advatages ad may be used for various applicatio. For the modificatio of map reduces programmig model used DB scale clusterig techique. the DB scale clusterig techique perform very well i terms of limited data. But the processig of data chage i petabyte the groupig rule ad policy is suffered for the creatio of data ode. I future the processig of petabyte data used some time based optimizatio techique. 6. REFERENCES [1] Carso Kai-Sag Leug, Richard Kyle MacKio ad Fa Jiag Reducig the Search Space for Big Data Miig for Iterestig Patters from Ucertai Data, IEEE, 214, Pp 315-322. [2] Rama Satish K. V. ad Dr. N. P. Kavya Big Data Processig with haressig Hadoop - MapReduce for Optimizig Aalytical Workloads, IEEE, 214, Pp 49-54. [3] Seugwoo Jeo, Boghee Hog ad Byugsoo Kim Big Data Processig for Predictio of Traffic Time based o Vertical Data Arragemet, IEEE, 214, Pp 327-333. 45

[4] Rajiv Raja Modelig ad Simulatio i Performace Optimizatio of Big Data Processig Frameworks, IEEE, 214, Pp 14-19. [5] Muhammad MazharUllahRathore, Aad Paul, Awais Ahmad, Bo-Wei Che, Bormi Huag, ad We Ji Real-Time Big Data Aalytical Architecture for Remote Sesig Applicatio, IEEE, 215, Pp 1-12. [6] Jyoti V Gautam, Harshadkumar B Prajapati, Vipul K Dabhi ad Sajay Chaudhary A Survey o Job Schedulig Algorithms i Big Data Processig, IEEE, 215, Pp 1-11. [7] Alfred Daiel, Aad Paul ad Awais Ahmad Near Real- Time Big Data Aalysis o Vehicular Networks, Iteratioal Coferece o Soft-Computig ad Network Security, 215, Pp 1-7. [8] Chu-Wei Tsai, Chi-Feg Lai, Mig-Chao Chiag ad Laurece T. Yag Data Miig for Iteret of Thigs: A Survey, IEEE, 214, Pp 77-97. [9] Albert Bifet Miig Big Data i Real Time, Iformatica, 213, Pp 15-2. [1] GwagbumPyu,Uil Yu ad Keu Ho Ryu Efficiet frequet patter miig based o Liear Prefix tree, Elsevier, 213, Pp 125-139. [11] Uil Yu ad Keu Ho Ryu Approximate weighted frequet patter miig with/without oisy eviromets, Elsevier, 21, Pp 73-82. [12] Zhi-Hua Zhou, Nitesh V. Chawla, YaochuJi ad Graham J. Williams Big Data Opportuities ad Challeges: Discussios from Data Aalytics Perspectives, IEEE, 211, Pp 1-2 [13] Boris Novikov, Natalia Vassilieva ad Aa Yarygia Queryig Big Data, Iteratioal Coferece o Computer Systems ad Techologies, 212, Pp 1-1. [14] Liwe Su, Reyold Cheg, David W. Cheug ad Jiefeg Cheg Miig Ucertai Data with Probabilistic Guaratees, ACM, 21, Pp 273-282. [15] Yuxua Li, James Bailey, Lars Kulik ad Jia Pei Miig Probabilistic Frequet Spatio-Temporal Sequetial Patters with Gap Costraits from Ucertai Databases, Pp 1-1. [16] Carso Kai-Sag Leug ad Fa Jiag Frequet Itemset Miig of Ucertai Data Streams Usig the Damped Widow Model, ACM, 211, Pp 95-955. IJCA TM : www.ijcaolie.org 46