Performace Optimizatio of Big Data Processig usig Clusterig Techique i Map Reduces Programmig Model Ravidra Sigh Raghuwashi Samrat Ashok Techological Istitute VIDISHA,M.P Idia Deepak Sai Samrat Ashok Techological Istitute VIDISHA, M.P Idia ABSTRACT The geeratio of techology ad requiremet fulfill the demad of digital uiverse data. Day to day the digital uiverse data are exploded i terms of megabyte ad petabyte. The explodig rate of data demads the ew geeratio of techology such as big data processig. I this paper optimized the performace of map reduce programmig model for the ehacemet of data processig. The modified model of programmig used clusterig techique. the clusterig techique icorporate the process of map data i terms of task group. The task group of map data correlated with differet idex of data for the processig of data ode. The proposed model implemeted i Hadoop framework ad programmed i java. For the evaluatio of performace used three stadard datasets ad measure the processig time ad cout value of file. Keys Big Data, Hadoop, MapReduce, Clusterig, Optimizatio 1. INTRODUCTION The icreasig rate of digital data faced a problem of processig, speed ad aalysis. The ormal file system ad framework caot support the cocept of Nosql. The cocept of Nosql precedes the ustructured ad uformatted data for the aalysis ad processig. For the processig of big data used map reduces fuctio process. The map reduces fuctio process basically based o java programmig model[1,2]. The java programmig model geerated the value of key task for the processig of data. I this dissertatio proposed prototype model for the processig of data. The prototype mode based o the cocept of data miig. The process of data miig gives ad precedes the various algorithms for the processig of data. The data miig techique provide associatio rule miig techique for the processig of relatio data[3,4]. Here miig clusterig techique are used for the improvemet of map reduces file processig i HADOOP data aalysis tool. the modified model of Map reduces simulated i Hadoop framework. The processig of map reduces fuctio based o two attribute data oe is cluster of data ad other is key value. The value of key geerates the processig of group of data for the process of aalysis[11]. The modified map reduces fuctio compoet ecapsulate the processig of DB scale clusterig techique. the DB scale clusterig techique defie the value of rage of group data for the processig of map cerates block. More tha 3% improvemet has bee observed i some cases of applicatios which are quite impressive from computatioal perspective. It has also bee observed that, the time for clusterig becomes almost statioary for higher umber of odes eve the iput volume of data has bee icreased from 7 millio to 1 millio[12]. Thus, DB scale beig very useful clusterig techique, usig it i cloud eviromet for processig Big Data has some iheret advatages ad may be used for various applicatio. For processig ad aalysis of datasets may tools are available ad the most popular ad widely used is Apache Hadoop [7]. Hadoop ca hadle all types of data such as structured, ustructured, log files, pictures etc. Hadoop supports redudacy, scalability, parallel processig, ad distributed architecture [7]. I geeral, distributed computig [8] is a field of computer sciece that ivolves multiple computers, located remotely from each other. Each computer has a commo shared role i a computatio problem ad coordiates their actios by message passig. Schedulig problem is also faced i other computig systems. The work i [9] addresses schedulig i geeral-purpose distributed computig eviromet.rest of this paper is orgaized as follows i Sectio II discusses MapReduce programmig Model, Sectio III proposed algorithm IV. Experimetal result aalysis Fially, cocluded i sectio V. 2. MAPREDUCE PROGRAMMING MODEL MapReduce is a programmig model which process large amout data i parallel way o large clusters of machies [14]. MapReduce program maily cosists of two fuctios i.e. map fuctio ad reduce fuctio as described i Fig. 1. Map fuctio takes value as iput ad geerates key: value pair. Whe all the values get a key, this programmig model groups all the values together accordig to their keys. This is the job of combier module. The output of the combier module becomes the iput of reduce fuctio. Reduce fuctio takes a key ad list of values as a iput. Reduce fuctio processes o the iput ad geerates output as per requiremet. Users defie map ad reduce fuctio ad the rutime architecture automatically distribute the task, take care of machie failures, hadles commuicatio amog differet odes, balace the load amog differet odes. Hadoop provides MapReduce rutime system alog with a distributed file system which provides high fault tolerace ad scalability. Hadoop distributed file system replicate the data across the ode which icreases availability of data. The file system uses TCP/IP for commuicatio. There are five kids of server available i hadoop as show i Fig.4.2. 2. Name ode, data ode, secodary ame ode hadle data storage, retrieval ad fault tolerace. Job tracker ad task tracker hadle map reduce computatioal part. 42
Figure 1 shows that processig of Map Reduces data segmet for the process of aalysis. 3. PROPOSED ALGORITHM The proposed model describe ito two differet sectio oe is the groupig of data for the processig of clusterig task ad other is reduces process for Hadoop framework. I this sectio, we have described about clusterig process of Map reduce framework for big data aalysis. the proposed Map reduce framework based data aalysis system, which cosists of three importat fuctios: Map, Itermediate system ad Reduce. The overall operatio of proposed architecture is give by DS M IMS R Fial value (1) Where, DS is the dataset, M is the mapper, IMS is the itermediate system MAPPER OPERATION A big data dataset DS, it is firstly divided ito umber of subsets. Subset cotais may attributes. DS i = DS 1 + DS 2 +.. + DS, < i < m (2) where, DS 1, DS 2 ad DS are the subsets. Normally, map is writte by the user, takes a iput pair ad geerates a set of itermediate key/value pairs. I map reduce architecture figure 2, for each data, we associate a map operatio. The first step is to partitio the iput dataset, typically stored i a distributed file system, amog the computers that execute the map fuctioality. From the logic perspective, all data is treated as a Key (K), Value (V) pair. Each attributes i the iput dataset is represeted as a <key1, value1>. I the secod step, each mapper applies the map fuctio o each sigle attribute to geerate a list o the form(< key2, value2 >), where () represets lists of legth zero or more. Map < key1, value1 > (< key2, value2 >) (3) I this cotext, firstly iitializes the ecessary structures, primarily iput key ad value. For this purpose, we have utilized the firefly ad aïve bayes classifier. Firefly algorithm based feature selectio process is explaied below: Firstly, we have developed a modified dataset from the traiig dataset for this fitess selectio purpose. The modified dataset cotais oly idetified attributes ( 1 s). This is created based o the iitial firefly. The this modified dataset is classified usig aive bayes classifier, we obtai mea ad variace. Mea μ = 1 Variece σ 2 = 1 x i (4) (x i μ) (5) Where, x i is the i th attribute is the umber of attribute 3.1 Reducer Operatio The third step is to shuffle the output of the mappers ito the systems that execute the reduce fuctioality. A reduce operatio takes all values represeted by the same key i the itermediate list ad processes them accordigly, emittig a fial ew list. Here, oce the best feature space is idetified through firefly algorithm, the big data aalysis is doe usig the aive bayes classifier. Output from all Map odes, <key1, ad <value1> etries, are grouped by key1values before beig distributed to Reduce operatio. It is the tur of Reduce operatio to combie value1 values accordig to a specific key1. Product of Reduce operatio may be i format of a list, <value2>or just a sigle value, value2. DS(< key1, value1 >) M (< key2, value2 >) R((< key2, value2 >) value2 (7) Aalysis usig DB Scale Validatio of each i comig iput data is attaied by tokeizig the attribute ad usig the pre-calculated attribute probability of each feature to classify the icomig value as reduced output data usig followig aïve bayes expressio. Firstly, calculate the mea ad variace equatio (4) ad (5) usig aive bayes classifier, ad the aalysis process is followed by For the aalysis as data the posterior posterior RO f 1, f 2,... f i = P(RO P(f io evidece )) For the aalysis as used data the posterior is give by Where, posterior RO O f 1, f 2,... f i = P(RO P f i RO O )) evidece (9) (8) 43
evidece = P O O) P f i O) + P(RO O ) p f i RO The evidece (also termed ormalizig costat) may be calculated sice the sum of the posterior probabilities must equal oe. tweet 15 51 27 32 11 6 32 335 135 45 42 1245 47 37 Table 2: Shows that the performace evaluatio of umber of, umber of cout, ad iput usig rt-tweet dataset. cout 15 45 32 345 11 47 45 135 65 37 38 1245 5 42 1175 51 27 32 Table 3: Shows that the performace evaluatio of umber of, umber of cout, ad iput usig tpcds-setup dataset. Figure 2 sows that processig of Map reduces file system usig DB scale clusterig techique Figure 3 sows that processig of Map reduces for the geeratio of data based o cluster ode. 4. EXPERIMENTAL ANALYSIS The proposed algorithm imlemeted i Hadoop tools.the hadoop tool is ope source liuex based freamwork. The hadoop freamwork proced the MapReduce fuctio for the aalysis of data. The propgrammig model of MapReduces i JAVA JDK compliatio tool. For the evlautio of proposed model used cout program for the aalysis of differet dataset. Table 1: Shows that the performace evaluatio of umber of, umber of cout, ad iput usig batch-tweet dataset. cout Batch- 1 5 35 cout 15 47 28 31 1 48 24 26 15 5 36 455 11 551 355 135 6 37 39 Table 4: Shows that the performace evaluatio of umber of, umber of cout, ad iput usig zipcode-setup dataset. cout 1 51 38 85 27 325 1 6 32 39 1135 45 435 1245 47 37 44
45 35 25 15 1 5 Figure 4: Shows that the comparative performace evaluatio graphs usig ad with batch-tweet dataset. 5 45 35 25 15 1 5 Comparative performace graph for ad with batch-tweet dataset Comparative performace graph for ad with dataset Figure 5: Shows that the comparative performace evaluatio graphs usig ad with rt-tweet dataset. 5 1 Comparative performace graph for ad with tpcds-setup dataset Figure 6: Shows that the comparative performace evaluatio graphs usig ad with tpcds-setup dataset. 5. CONCLUSION & FUTURE SCOPE I this paper modified the map reduces programmig model usig DB scale clusterig techique. the modified model of Map reduces simulated i Hadoop framework. The processig of map reduces fuctio based o two attribute data oe is cluster of data ad other is key value. The value of key geerates the processig of group of data for the process of aalysis. The modified map reduces fuctio compoet ecapsulate the processig of DB scale clusterig techique. the DB scale clusterig techique defie the value of rage of group data for the processig of map cerates block.more tha 3% improvemet has bee observed i some cases of applicatios which are quite impressive from computatioal perspective. It has also bee observed that, the time for clusterig becomes almost statioary for higher umber of odes eve the iput volume of data has bee icreased from 7 millio to 1 millio. Thus, DB scale beig very useful clusterig techique, usig it i cloud eviromet for processig Big Data has some iheret advatages ad may be used for various applicatio. For the modificatio of map reduces programmig model used DB scale clusterig techique. the DB scale clusterig techique perform very well i terms of limited data. But the processig of data chage i petabyte the groupig rule ad policy is suffered for the creatio of data ode. I future the processig of petabyte data used some time based optimizatio techique. 6. REFERENCES [1] Carso Kai-Sag Leug, Richard Kyle MacKio ad Fa Jiag Reducig the Search Space for Big Data Miig for Iterestig Patters from Ucertai Data, IEEE, 214, Pp 315-322. [2] Rama Satish K. V. ad Dr. N. P. Kavya Big Data Processig with haressig Hadoop - MapReduce for Optimizig Aalytical Workloads, IEEE, 214, Pp 49-54. [3] Seugwoo Jeo, Boghee Hog ad Byugsoo Kim Big Data Processig for Predictio of Traffic Time based o Vertical Data Arragemet, IEEE, 214, Pp 327-333. 45
[4] Rajiv Raja Modelig ad Simulatio i Performace Optimizatio of Big Data Processig Frameworks, IEEE, 214, Pp 14-19. [5] Muhammad MazharUllahRathore, Aad Paul, Awais Ahmad, Bo-Wei Che, Bormi Huag, ad We Ji Real-Time Big Data Aalytical Architecture for Remote Sesig Applicatio, IEEE, 215, Pp 1-12. [6] Jyoti V Gautam, Harshadkumar B Prajapati, Vipul K Dabhi ad Sajay Chaudhary A Survey o Job Schedulig Algorithms i Big Data Processig, IEEE, 215, Pp 1-11. [7] Alfred Daiel, Aad Paul ad Awais Ahmad Near Real- Time Big Data Aalysis o Vehicular Networks, Iteratioal Coferece o Soft-Computig ad Network Security, 215, Pp 1-7. [8] Chu-Wei Tsai, Chi-Feg Lai, Mig-Chao Chiag ad Laurece T. Yag Data Miig for Iteret of Thigs: A Survey, IEEE, 214, Pp 77-97. [9] Albert Bifet Miig Big Data i Real Time, Iformatica, 213, Pp 15-2. [1] GwagbumPyu,Uil Yu ad Keu Ho Ryu Efficiet frequet patter miig based o Liear Prefix tree, Elsevier, 213, Pp 125-139. [11] Uil Yu ad Keu Ho Ryu Approximate weighted frequet patter miig with/without oisy eviromets, Elsevier, 21, Pp 73-82. [12] Zhi-Hua Zhou, Nitesh V. Chawla, YaochuJi ad Graham J. Williams Big Data Opportuities ad Challeges: Discussios from Data Aalytics Perspectives, IEEE, 211, Pp 1-2 [13] Boris Novikov, Natalia Vassilieva ad Aa Yarygia Queryig Big Data, Iteratioal Coferece o Computer Systems ad Techologies, 212, Pp 1-1. [14] Liwe Su, Reyold Cheg, David W. Cheug ad Jiefeg Cheg Miig Ucertai Data with Probabilistic Guaratees, ACM, 21, Pp 273-282. [15] Yuxua Li, James Bailey, Lars Kulik ad Jia Pei Miig Probabilistic Frequet Spatio-Temporal Sequetial Patters with Gap Costraits from Ucertai Databases, Pp 1-1. [16] Carso Kai-Sag Leug ad Fa Jiag Frequet Itemset Miig of Ucertai Data Streams Usig the Damped Widow Model, ACM, 211, Pp 95-955. IJCA TM : www.ijcaolie.org 46