I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Similar documents
IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 01, 2016 ISSN (online):

FINE-GRAIN INCREMENTAL PROCESSING OF MAPREDUCE AND MINING IN BIG DATA ENVIRONMENT

Incremental and Iterative Map reduce For Mining Evolving the Big Data in Banking System

IMPROVING MAPREDUCE FOR MINING EVOLVING BIG DATA USING TOP K RULES

Survey on Incremental MapReduce for Data Mining

INCREMENTAL STREAMING DATA FOR MAPREDUCE IN BIG DATA USING HADOOP

A NOVEL JOB SCHEDULING ALGORITHM TO ENHANCE EXECUTION AND RESOURCE MANAGEMENT USING MAP REDUCE

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Memory-Optimized Distributed Graph Processing. through Novel Compression Techniques

TODAY huge amount of digital data is being accumulated

Resilient Distributed Datasets

Big Graph Processing. Fenggang Wu Nov. 6, 2016

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

Optimistic Recovery for Iterative Dataflows in Action

Twitter data Analytics using Distributed Computing

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Fast, Interactive, Language-Integrated Cluster Computing

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING

iihadoop: an asynchronous distributed framework for incremental iterative computations

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Spark: A Brief History.

Oolong: Asynchronous Distributed Applications Made Easy

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation

Digree: Building A Distributed Graph Processing Engine out of Single-node Graph Database Installations

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Global Journal of Engineering Science and Research Management

Aura: A Flexible Dataflow Engine for Scalable Data Processing

Shark: Hive on Spark

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Massive Online Analysis - Storm,Spark

Efficient Map Reduce Model with Hadoop Framework for Data Processing

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Comparative Analysis of Range Aggregate Queries In Big Data Environment

Performance Evaluation of Large Table Association Problem Implemented in Apache Spark on Cluster with Angara Interconnect

Batch Inherence of Map Reduce Framework

Survey on MapReduce Scheduling Algorithms

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Cloud, Big Data & Linear Algebra

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Pregel: A System for Large-Scale Graph Proces sing

FPGP: Graph Processing Framework on FPGA

USING DYNAMOGRAPH: APPLICATION SCENARIOS FOR LARGE-SCALE TEMPORAL GRAPH PROCESSING

Clustering Lecture 8: MapReduce

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

2/4/2019 Week 3- A Sangmi Lee Pallickara

Database Applications (15-415)

MapReduce Simplified Data Processing on Large Clusters

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

Mitigating Data Skew Using Map Reduce Application

Handling limits of high degree vertices in graph processing using MapReduce and Pregel

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Optimizing CPU Cache Performance for Pregel-Like Graph Computation

Giraphx: Parallel Yet Serializable Large-Scale Graph Processing

Scaling Data Mining in Massively Parallel Dataflow Systems

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

The MapReduce Framework

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Data Analytics on RAMCloud

MapReduce, Hadoop and Spark. Bompotas Agorakis

CSE 444: Database Internals. Lecture 23 Spark

The MapReduce Abstraction

S. Indirakumari, A. Thilagavathy

MMap: Fast Billion-Scale Graph Computation on a PC via Memory Mapping

MPGM: A Mixed Parallel Big Graph Mining Tool

Lecture 11 Hadoop & Spark

AN INTRODUCTION TO GRAPH COMPRESSION TECHNIQUES FOR IN-MEMORY GRAPH COMPUTATION

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

CS 61C: Great Ideas in Computer Architecture. MapReduce

Implementation and Performance Analysis of Apache Hadoop

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

ABSTRACT I. INTRODUCTION

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Report. X-Stream Edge-centric Graph processing

A Survey on Job Scheduling in Big Data

Introduction to MapReduce Algorithms and Analysis

Piccolo. Fast, Distributed Programs with Partitioned Tables. Presenter: Wu, Weiyi Yale University. Saturday, October 15,

An Overview of Projection, Partitioning and Segmentation of Big Data Using Hp Vertica

Puffin: Graph Processing System on Multi-GPUs

A Case Study of OpenMP applied to Map/Reduce-style Computations

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

[CoolName++]: A Graph Processing Framework for Charm++

Similarities and Differences Between Parallel Systems and Distributed Systems

Fast Graph Mining with HBase

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph

Parallel Computing: MapReduce Jin, Hai

Introduction to MapReduce

Apache Spark and Hadoop Based Big Data Processing System for Clinical Research

Transcription:

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. IV (May-Jun. 2016), PP 125-129 www.iosrjournals.org I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data Prof. Kanchan Dhomse B. 1 1 (IT, METS BKC IOE, NASHIK, India) Abstract: Data mining is an interdisciplinary area of computer engineering. Incremental processing is a challenging approach to refreshing mining results and it uses saved states to avoid the cost of re computation from damage. Here, I propose i 2 MapReduce is a novel incremental processing expansion to MapReduce in data mining. As compared with the state-of-the-art work on Incoop, instead of using task level re-computation the i 2 MapReduce executes the key-value pair level incremental processing. It helps not only one-step computation but also more sophisticated iterative computation. Then it incorporates a set of novel techniques to decrease input output level for accessing dedicated fine-grain computation states in the data mining. The final results on Amazon EC2 describe the significant performance development of i 2 MapReduce as compared to iterative and plain MapReduce performing re computation. Keywords: Big data, Incremental Processing, Mapreduce, MRBGraph, Recompilation. I. Introduction Recently, huge amount of digital data is being assembled in many important areas like social network, e-commerce, education, finance, health care and environment. It has become increasingly popular to mine such big data in order to obtain outputs to help business decisions or to give better personalized and higher quality services in data mining. Recently a huge number of computing frameworks [1-10] have been developed for big data analysis in data mining. So, among these frameworks the MapReduce [1] with its open-source implementations, such as Hadoop is the mostly used in production due to its simplicity and generality and maturity. Here, I focus on improving MapReduce performance. Actually, Big data is constantly increasing new updates are being collected and the input data of a big data mining algorithm will slowely change. Generally, it is desirable to periodically updates the mining computation in order to keep the mining results accurate. Here, some are examples like the PageRank algorithm [11] evaluates ranking scores of web pages based on the web graph structure for supporting web search in data mining. Here, I explain the nature of the problem and previous work, purpose, and the contribution of the paper which shows the structure of mapreduce. The incremental processing utilize the fact that the input data of two subsequent computations are similar. Only a very small fraction of the input data has changed. Here, system observe and find the realization of this principle in the context of the MapReduce computing framework in data mining. Fig.1: computation of Map Reduce II. Implementation 1. Mapreduce Background Generally, a MapReduce program is made up of a Map function and a Reduce function [1], as shown in Fig. 1. Their APIs are as shown below: Map(k1,v1)-> (<k2,v2>) Reduce(k2,v2)->(<k3,v3>) Here, the Map function enters a kv-pair <K1, V1> as input and evaluate zero or more middle kv-pairs <K2,V2> in mapreduce function. Now all <K2, V2> are grouped by K2 and The Reduce function inserts a K2 and a list of V 2 as input and computes the final output kv-pairs <K3,V3> in mapreduce. So, a MapReduce DOI: 10.9790/0661-180304125129 www.iosrjournals.org 125 Page

normally reads the input data of the MapReduce evaluation from and writes the final results to a distributed file system which distributes a file into equal-sized i.e. 64 MB blocks and collects the blocks across a cluster of machines in data mining. The mapreduce system executes a JobTracker process on a master node to concentrate the job progress and a set of Task Tracker processes on worker nodes to implement the actual Map and Reduce tasks. Then, the Job Tracker starts a Map task per data block and typically assigns it to the TaskTracker on the machine that holds the corresponding data block in order to decrease communication. At each Map task calls the map function for every input <K1, V1> pair and stores the middle kv-pairs <K2, V2> on local disks in mapreduce function. The intermediate results are shuffled to reduce tasks according to a divides the function on K2 in pair. So, after a reduce task gets and merges middle pair s results from all map tasks then it invokes the reduce function on every <K2, V 2> pair to create the final output kv-pairs <K3, V3>among all pairs. 2 Fine- Grain Incremental Processing For One-Step Computation Here, we start by describing the basic idea of fine-grain incremental processing in the mapreduce function. it shows the main design including the MRBGraph abstraction and the incremental processing engine in mapreduce process. Here we divide it into two aspects of the design. i.e. the mechanism that maintain the fine-grain states and the handling of a special case where the Reduce function performs collection function. 2.1 MRBGraph Abstraction We use a MRBGraph i.e. Map Reduce Bipartite Graph is the abstraction to model the data flow in MapReduce function. As shown in Fig. 2(a) every vertex in the Map task represents an single Map function call instance on a pair of <K1,V1> nodes. At every vertex in the Reduce task represents an individual Reduce function call instance on a group of <K2; V 2>. The range from a Map instance to a Reduce instance means that the Map instance creates a <K2,V2> pair that is shuffled to become part of the input to the Reduce instance of pair. Here, for example the input of reduce instance getting from map instance 0, 2, and 4 respectively. Therefore, i 2 MapReduce will maintain (K2, MK, V 2) for each MRBGraph edge in given function. Fig. 2. MRB Graph 2.2 Fine-Grain Incremental Processing Engine Here, figure shows the fine-grain incremental function engine with an example application which evaluates the addition of in-edge weights for each vertex in a graph of MRBG. So, as shown the input data i.e. the graph structure includes over time in running process. We present and explains that how the engine performs incremental processing to update the analysis results in mapreduce techniques. III. Algorithms In data mining and analysis the fundamental algorithms form field of data science which includes default methods to analyze patterns and models for all kinds of data with applications ranging from scientific discovery to business intelligence and analytics at the time of algorithm selection [8]. 3 General-Purpose Support For Iterative Computation 3.1 Analyzing Iterative Computation : PageRank is a iterative graph algorithm for ranking web pages in data mining. So, it evaluates a ranking score for each vertex in a graph in algorithm. DOI: 10.9790/0661-180304125129 www.iosrjournals.org 126 Page

Algorithm 1: PageRank in MapReduce I ++ Mapreduce: Incremental Mapreduce for Mining The Big Data Kmeans : The Kmean is a generally used clustering algorithm that partitions points into k clusters. Algorithm 2: kmeans in MapReduce Fig. 3. Incremental processing for an application 4 Incremental Iterative Processing Here, we shows the incremental processing techniques for iterative evaluation. But it is not sufficient to simply gather the above solutions for incremental one step processing and iterative computation given below we discuss 3 main aspects that we noticed in order to obtaine an effective design of i 2 mapreduce. Fig. 6. Iterative model of i2mapreduce. DOI: 10.9790/0661-180304125129 www.iosrjournals.org 127 Page

IV. Result Analysis As we described all the algorithms which are helpful for i 2 mapreduce Solutions. Actually, our experiments compare four solutions initially PlainMR recomp re-computation on vanilla Hadoop then itermr recomp and re-computation on Hadoop optimized for iterative evaluation. Secondly, hadoop re-computation on the iterative Mapreduce framework Hadoop which controls mapreduce by providing a structure data caching mechanism then in i 2 Mapreduce our proposed solution is to best of our knowledge the task-level coarse-grain incremental processing system. Here, Incoop is not normally present that s why we can t compare i 2 MapReduce with Incoop technique. So, the statistics describes that without careful data partition almost all tasks see changes in the experiments and making task-level incremental processing less important than the other. Fig. 7. Run time of individual stages in PageRank. Fig. 8. Normalized runtime. V. Conclusion In data mining includes mapreduce function described i 2 MapReduce technique. A MapReduce based framework for incremental big data process is used for i 2 mapreduce function. I 2 MapReduce collects a fine-grain incremental engine and a general-purpose iterative model also a set of useful techniques for incremental iterative evaluation. The real-machine experiments describes that i 2 MapReduce can significantly decrease the run time for updating the big data mining results as compared to the recomputation on both plain and iterative MapReduce. Acknowledgements We especially thanks to our department and our Institute for the great support regarding paper and their all views. I really thankful to my all staffs and our H.O.D. Prof. N.R. kale. who showed me the way of successful journey of publishing paper. DOI: 10.9790/0661-180304125129 www.iosrjournals.org 128 Page

References Journal Papers [1]. J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, in Proc. 6th Conf. Symp. Opear. Syst. Des.Implementation, 2004, p. 10. [2]. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed Datasets : A fault-tolerant abstraction for, in-memory cluster computing, in Proc. 9th USENIX Conf. Netw. Syst. Des. Implementation,2012, p. 2. [3]. R. Power and J. Li, Piccolo: Building fast, distributed programs with partitioned tables, in Proc. 9th USENIX Conf. Oper. Syst. Des.Implementation, 2010, pp. 1 14. [4]. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, Pregel: A system for large-scale Graph processing, in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010, pp. 135 146. [5]. S. R. Mihaylov, Z. G. Ives, and S. Guha, Rex: Recursive, deltabased data-centric computation, in Proc. VLDB Endowment, 2012,vol. 5, no. 11, pp. 1280 1291. [6]. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein, Distributed graphlab: A framework for machine Learning and data mining in the cloud, in Proc. VLDB Endowment, 2012, vol. 5, no. 8, pp. 716 727. Proceedings Papers [7]. S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl, Spinning fast iterative data flows, in Proc. VLDB Endowment, 2012, vol. 5,no. 11, pp. 1268 1279. Books [8]. Mohammed j. Zaki s and Wagner Meira jr. s Data mining and Analysis (Cambridge university, 2014 ) DOI: 10.9790/0661-180304125129 www.iosrjournals.org 129 Page