FINE-GRAIN INCREMENTAL PROCESSING OF MAPREDUCE AND MINING IN BIG DATA ENVIRONMENT

Similar documents
I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 01, 2016 ISSN (online):

IMPROVING MAPREDUCE FOR MINING EVOLVING BIG DATA USING TOP K RULES

Incremental and Iterative Map reduce For Mining Evolving the Big Data in Banking System

Survey on Incremental MapReduce for Data Mining

INCREMENTAL STREAMING DATA FOR MAPREDUCE IN BIG DATA USING HADOOP

Resilient Distributed Datasets

TODAY huge amount of digital data is being accumulated

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

Shark: Hive on Spark

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

iihadoop: an asynchronous distributed framework for incremental iterative computations

Twitter data Analytics using Distributed Computing

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Fast, Interactive, Language-Integrated Cluster Computing

Mitigating Data Skew Using Map Reduce Application

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Memory-Optimized Distributed Graph Processing. through Novel Compression Techniques

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

A NOVEL JOB SCHEDULING ALGORITHM TO ENHANCE EXECUTION AND RESOURCE MANAGEMENT USING MAP REDUCE

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

Spark: A Brief History.

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

CompSci 516: Database Systems

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

MapReduce Simplified Data Processing on Large Clusters

Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Massive Online Analysis - Storm,Spark

Lecture 11 Hadoop & Spark

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

HaLoop Efficient Iterative Data Processing on Large Clusters

Survey on MapReduce Scheduling Algorithms

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

2/4/2019 Week 3- A Sangmi Lee Pallickara

Large-scale Incremental Data Processing with Change Propagation

Parallel Computing: MapReduce Jin, Hai

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

MapReduce. U of Toronto, 2014

Improving the MapReduce Big Data Processing Framework

Developing MapReduce Programs

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Map Reduce. Yerevan.

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Cloud Computing CS

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

B490 Mining the Big Data. 5. Models for Big Data

Hadoop Map Reduce 10/17/2018 1

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Jumbo: Beyond MapReduce for Workload Balancing

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

BigData and Map Reduce VITMAC03

CS 61C: Great Ideas in Computer Architecture. MapReduce

The HaLoop Approach to Large-Scale Iterative Data Analysis

Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Map-Reduce. Marco Mura 2010 March, 31th

Introduction to MapReduce

Map Reduce Group Meeting

Oolong: Asynchronous Distributed Applications Made Easy

Efficient Map Reduce Model with Hadoop Framework for Data Processing

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

HADOOP FRAMEWORK FOR BIG DATA

Distributed computing: index building and use

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Optimistic Recovery for Iterative Dataflows in Action

Big Data Management and NoSQL Databases

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

Data-intensive computing systems

Databases 2 (VU) ( / )

A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data

ABSTRACT I. INTRODUCTION

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University

CHAPTER 4 ROUND ROBIN PARTITIONING

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Introduction to MapReduce Algorithms and Analysis

modern database systems lecture 10 : large-scale graph processing

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Apache Flink- A System for Batch and Realtime Stream Processing

Map-Reduce. John Hughes

The Future of High Performance Computing

A Fast and High Throughput SQL Query System for Big Data

Data Analysis Using MapReduce in Hadoop Environment

Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud

Transcription:

FINE-GRAIN INCREMENTAL PROCESSING OF MAPREDUCE AND MINING IN BIG DATA ENVIRONMENT S.SURESH KUMAR, Jay Shriram Group of Institutions Tirupur sureshmecse25@gmail.com Mr.A.M.RAVISHANKKAR M.E., Assistant Professor, Jay Shriram Group of Institutions, Tirupur ravi662shankkar@gmail.com Dr.S.RAJALAKSHMI Ph.D., Associate Professor, Jay Shriram Group of Institutions, Tirupur, mrajislm@gmail.com ABSTRACT Seeing that new data and updates are continuously arriving, the results of data mining applications turn out to be stale and obsolete over time. Incremental processing is a talented move towards to refreshing mining results. It utilizes previously saved states to avoid the expense of re-computation from scratch. Suggest i 2 MapReduce, a work of fiction incremental processing extension to MapReduce, the mostly used structure for mining data. Compared with the state-of-the-art work on Incoop, i 2 MapReduce performs key-value pair level incremental processing somewhat than task level re-computation, supports not only one-step computation but also more sophisticated iterative computation, which is widely used in data mining applications, and incorporates a set of tale techniques to reduce I/O overhead for accessing preserved fine-grain computation states. Evaluate i2mapreduce using a one-step algorithm and four iterative algorithms with diverse computation description. Keywords Big Data, MapReduce, Incremental, and Datamining. 1. INTRODUCTION In this work, the MapReduce framework offers techniques for convenient, distributed processing of data by enabling a simple coding model that reduces the burden of implementing a complex logic or infrastructure for parallelization, data transfer, fault tolerance and scheduling. An important property of the workloads processed by MapReduce applications is that they are often incremental by nature; i.e., MapReduce jobs often run repeatedly with small changes in their input. For example, search engines will periodically crawl the Web and perform various computations on this input, such as computing a Web index or the PageRank metric, often with very small modifications. This growing nature of data suggests that performing large-scale computations incrementally can improve efficiency dramatically. Broadly speaking there are two methods to reach such efficient incremental updates. The first method would be to devise systems that provide the programmer with facilities to save and use state across successive runs so that only computations that are affected by the changes to the input would need to be executed. This is precisely the strategy taken by major Internet companies who developed systems like Percolator or CBP. However this method, requires adopting a new programming model and a new Application Programming Interface (API) that differs from the one used by MapReduce. These new APIs also require the coder to devise a way to process updates efficiently, which can improves 61

algorithmic and software complexity. Research in the algorithms community on algorithms for dynamically or dynamic data show that such algorithms can be very complex even for problems that are relatively straightforward in the non-incremental case. The second method would be to develop systems that can reuse the results of prior computation transparently. This method would shift the complexity of incremental processing from the programmer to the processing system, essentially keeping the spirit of high-level models such as MapReduce. A few methods have taken this approach, e.g., DryadInc, and Nectar, in the context of the Dryad system by providing techniques for task-level or LINQ expression-level memorization. Fig. 1 Architecture diagram The design of Incoop contains the following new methods that we incorporated into the Hadoop MapReduce framework, and which showed to be instrumental in achieving efficient incremental computations. Incremental HDFS. Instead of relying on HDFS to store the input to MapReduce jobs, we devise a file system called Inc-HDFS that provides mechanisms to identify similarities in the input data of consecutive job runs. The idea is to divide the input into chunks whose boundaries depend on the file contents so that small changes to input don t change all chunk boundaries. Therefore partitions the input in a way that maximizes the opportunities for reusing results from previous computations, while preserving compatibility with HDFS, by providing the same interface and semantics. Contraction phase. We introduce methods for controlling the granularity of tasks so that large tasks can be divided into smaller subtasks that can be re-used even when the large tasks cannot. This is particularly challenging in Reduce tasks, whose granularity depends solely on their input. Our solution is to introduce a new Contraction phase that leverages Combiner functions, normally used to reduce network traffic by anticipating a small part of the processing done by Reducer tasks, to control the granularity of the Reduce tasks. Memoization-aware scheduler. To increase effectiveness of memoization, we suggest an affinitybased scheduler that uses a work stealing algorithm to reduce the amount of data movement across machines. Our new scheduler strikes a balance between exploiting the locality of previously computed results and executing tasks on any available machine to prevent straggling effects. Use cases. We employ Incoop to demonstrate two important use cases of incremental processing: incremental log processing, where we use Incoop to build a framework to incrementally process logs as more entries are added to them; and incremental query processing, where we layer the Pig framework on top of Incoop to enable relational query processing on continuously arriving data. 62

When we design and implement an incremental computation framework, many factors including algorithm accuracy, run time, and space overhead need to be taken into consideration. This paper puts focus on the transplantation of parallel algorithms based on MapReduce model and compatibility of nonincremental and incremental processing. A parallel programming framework is presented, which aims to be compatible with the original MapReduce APIs so that programmers do not need to rewrite the algorithms. It makes the following contributions: Incremental MapReduce framework. It supports incremental data input, incremental data processing, intermediate state preservation, incremental map and reduce functions. The input manager can dynamically discover new inputs and then submit jobs to the master node. Dynamic resource allocation based on the state. The state provides an important reference for resource request and allocation of the next execution. State information includes prior processing results, intermediate results, execution time, and the number of reduce tasks. Input data, acting as the observation, will change the current state into a new state after the job finishes. Friendly APIs for applications. Method submitjob() in Class JobClient is overloaded. Users only need to follow the method parameters to submit their jobs without modifying their algorithms or application programs. Furthermore, for continuous inputs, users can get updated outputs in time. 2 RELATED WORKS The Page Rank algorithm computes ranking scores of web pages based on the web graph structure for supporting web search. The web graph structure is constantly evolving web pages and hyper-links are created, deleted, and updated. As the underlying web graph evolves, the PageRank ranking results gradually become stale, potentially lowering the quality of web search. It is desirable to refresh the PageRank computation regularly. Incremental processing is a promising approach to refreshing mining results. Given the size of the input big data, it is often very expensive to rerun the entire computation from scratch. Incremental processing exploits the fact that the input data of two subsequent computations A and B are similar. Only a very small fraction of the input data has changed. The idea is to save states in computation A, re-use A s states in computation B, and perform recomputation only for states that are affected by the changed input data. A number of existing studies have followed this principle and designed new programming models to support incremental processing. The new programming models are drastically different from MapReduce, requiring programmers to completely re-implement their algorithms. Incoop extends MapReduce to support incremental processing. It has two main limitations. First, Incoop supports only task-level incremental processing. It saves and reuses states at the granularity of individual Map and Reduce tasks. Each task typically processes a large number of key-value pairs. If Incoop detects any data changes in the input of a task, it will rerun the entire task. While this approach easily leverages existing MapReduce features for state savings, it may incur a large amount of redundant computation if only a small fraction of kv-pairs have changed in a task. Second, Incoop supports only onestep computation, while important mining algorithms, such as PageRank, require iterative computation. Incoop would treat each iteration as a separate MapReduce job. A small number of input data changes may gradually propagate to affect a large portion of 63

intermediate states after a number of iterations, resulting in expensive global re-computation afterwards. The computation is broken down into a sequence of supersteps. In each superstep, a Compute function is invoked on each vertex. It communicates with other vertices by sending and receiving messages and performs computation for the current vertex. This method can efficiently support a large number of iterative graph algorithms. It provides a group wise processing operator Translate that takes state as an explicit input to support incremental analysis. But it adopts a new programming model that is very different from MapReduce. Several research studies support incremental processing by task-level re-computation, but they require users to manipulate the states on their own. In contrast, i2mapreduce exploits a fine-grain kvpair level re-computation that are more advantageous. Incremental processing for iterative application, Proposes a timely dataflow paradigm that allows state full computation and arbitrary nested iterations. To support incremental iterative computation, programmers have to completely rewrite their MapReduce programs. Extend the widely used MapReduce model for incremental iterative computation. Previous Map-Reduce programs can be slightly changed to run on i2mapreduce for incremental processing. A. Limitations of MapReduce MapReduce programming model is popular due to its simplicity and high efficiency. In MapReduce framework, when a job is submitted, the related input is portioned into fixed size pieces called blocks which are located in different data nodes. A job creates multiple splits according to the number of blocks and each split is processed independently as the input of a separate map task. A map task in a worker node runs the userdefined map function for each record in the split and writes its output to the local disk. When there are multiple reducers, the map task divides their outputs and each partition is for one reduce task. Reduce tasks will write their outputs to the distributed file system. Despite its powerful automatic parallelization with strong fault-tolerance, the original MapReduce exhibits the following limitations. Stateless. When map and reduce tasks finish, they write their outputs to a local disk or a distributed file system, and then inform the scheduler. When a job completed, related intermediate outputs will be deleted by a cleanup mechanism. When new data or input arrives, a new job needs to be created to process it. This is just like HTTP, a stateless protocol, which provides no means of storing a user s data between requests. To some extend, we can also say that MapReduce model is stateless. Stage independent. A MapReduce job can be divided into two stages: map stage and reduce stage. Each stage will not interrupt the other s execution. In the map stage, each computing thread executes the map method according to the input split allocated to it, and writes the output to the local node. In the reduce stage, each reduce thread fetches input from designate nodes, executes the reduce method, and writes the output to the specified file system. All map tasks and reduce tasks execute their codes without disturbing each other. Singe-step. Map tasks and reduce tasks will execute only once orderly for a job. Map tasks may finish at different times, and reduce tasks start copying their outputs as soon as all map tasks complete successfully. B. Extension of MapReduce Because of the limitations listed above, MapReduce model has to be extended for incremental 64

computation. They feature different input patterns, separate or coupled control approaches between state and dataflow, compatible or new added interfaces. Batch parallel processing refers to those that provide high efficient large scale parallel computation with one batch input and produce one output such as Google MapReduce, Hadoop and Dyrad/DryadLINQ. Incremental algorithms, more complicated than the original algorithms, can improve the runtime by modifying algorithms. Its input includes newly added data and the latest running result. Continuous bulk processing is another kind of solution to support incremental processing by providing new primitives for developers to design delicate dataflow oriented applications. It takes the secondary results of the prior executions as a part of explicit input. CBP of Yahoo and Percolator of Google both provide such incremental computation frameworks. Incremental computation based on MapReduce, making full use of MapReduce programming model, supports incremental processing by modifying the kernel implementation of map and reduces stages. Because HDFS does not support appends currently, some approaches also modify the distributed file system to support incremental data discovery and intermediate result storage. IncMR is an improved method for large-scale incremental data processing. The framework, inherits the simplicity of the original MapReduce model. It doesn t modify HDFS and still uses the same APIs of MapReduce. All algorithms or programs can complete incremental data processing without any modification. Compatibility. Original MapReduce interfaces are retained to avoid incompatibility with the existing implementation of applications. New interfaces are added by overloading methods. HDFS is still used without modification. Transparency. Users don t need to know how their incremental data are processed. All state data including their storage paths are transparent to users. Reduced resource usage. Computing resource request and allocation are determined by the historical state information and current added data size. Dynamic scheduling decision is useful for minimizing the overhead. Several important modules are added in IncMR framework. The input manager is used to find the newly added data automatically. According to the execution delay configuration or input size threshold, it determines if a new data processing will be started. Job scheduler, different from traditional job scheduler which determines the number of tasks to run only according to the configuration file defined in advance, takes the state into consideration when choosing nodes for reduce tasks. A state manager store all needed information for incremental jobs and provides decision support for job scheduler. Output manager is responsible for the storage and update of all outputs. C. Locality control and optimization Job scheduler is responsible for creating map tasks and reduce tasks. The main overhead of IncMR lies in the storage and transmission of many intermediate results. In the shuffle phase, the framework fetches the relevant partition of the output of all the mappers to each reduce task node via HTTP. The shuffle and sort phases occur simultaneously; while mapoutputs are being fetched they are merged. Locality control is always used for optimization of job and task scheduling. Naive locality control. When there is only one reduce task and it is always located in the same node, the reduce task can fetch cached state from the local node directly without needing to fetch the prior map outputs 65

from other nodes. Basically, it is the simplest to control the data flow when there is only one reduce task. Complex locality control. If the number of reduce tasks is more than 1 and is not changed, job scheduler will try to allocate reduce tasks to the nodes that have performed reduce tasks recently because they have cached related state. What s more, the fetched prior results are also sorted. This locality control will save plenty of time for data transferring. If the number of reduce tasks or the position of reduce task is changed, state fetching and data repartition operations can t be avoided. Iterative MapReduce. Iterative computing widely exists in data mining, text processing, graph processing, and other data intensive and computing-intensive applications. MapReduce programming paradigm is designed initially for single step execution. Its high efficiency and simplicity attract a lot of enthusiasm of applying it in iterative environment and algorithms. HaLoop, a runtime based on Hadoop, supports iterative data analysis applications especially for large-scale data. By caching the related invariant data and reducers local outputs for one job, it can execute recursively. Twister is a lightweight MapReduce runtime. It uses publish/subscribe messaging infrastructure for communication and supports iterative task execution. In order to provide solutions for applications such as data mining or social network analysis with relational data, imapreduce is designed to support automatically processing iterative tasks by reusing the prior task processors and eliminating the shuffle load of the static data. Additionally, iterative computing is an indispensable part in incremental processing. We also apply related iterative computing methods in our IncMR framework. especially for unstructured data. Reference implements an ad-hoc data processing abstraction in a distributed stream processor based on MapReduce programming model to support continuous inputs. MapReduce Online adopts pipelining technique within a job and between jobs, supports single-job and multi-job online aggregation, and also provides database continuous queries over data streams. Reference combines MapReduce programming model with the continuous query model characterized by Cut-Rewind to process dynamic stream data chunk. CMR, continuous MapReduce, is an architecture for continuous and largescale data analysis. Continuous processing is a special case of incremental processing. Although the input is continuous, the processing is discrete. Time interval or delay is an important factor to be considered. IncMR presented in this paper supports continuous processing. Applications based on MapReduce. When programming in MapReduce framework, all we should do is to prepare the input data, implement the mapper, the reducer, and optionally, the combiner and the partitioner. The execution is handled transparently by the framework in clusters ranging from a single node to thousands of nodes with the data-set ranging from gigabytes to petabytes and different data structures including rational database, text, graph, video and audio. Through a simple interface with two functions, map and reduce, MapReduce model facilitates parallel implementation of many real-world tasks such as data processing for search engines and machine learning. MapReduce is now popularly used in scientific computing fields too. When the new input arrives, it is a challenge for applications to continuously deal with it based on the original MapReduce. This paper addresses the problem in general. Continuous MapReduce. Ad-hoc data processing is a critical paradigm for wide-scale data processing 66

3 OUR WORK We Propose i2mapreduce, an extension to MapReduce that supports fine-grain incremental processing for both one step and iterative computation. Compared to previous solutions, i2mapreduce incorporates the following three novel features: Fine-grain incremental processing using MRBG- Store Incoop, i2mapreduce supports kv-pair level fine-grain incremental processing in order to minimize the amount of re-computation as much as possible. Model the kv-pair level data flow and data dependence in a MapReduce computation as a bipartite graph, called MRBGraph. A MRBG-Store is designed to preserve the fine-grain states in the MRBGraph and support efficient queries to retrieve fine-grain states for incremental processing. General-purpose iterative computation With modest extension to MapReduce API, previous work proposed imapreduce to efficiently support iterative computation on the MapReduce platform. It targets types of iterative computation where there is a one-to-one/all-to-one correspondence from Reduce output to Map input. In comparison, our current method provides general-purpose support, including not only one-to-one, but also one-to-many, many-to-one, and many-to-many correspondence. Enhance the Map API to allow users to easily express loop-invariant structure data, and propose a Project API function to express the correspondence from Reduce to Map. While users need to slightly change their algorithms in order to take full advantage of i2mapreduce, such modification is modest compared to the effort to reimplement algorithms on a completely different programming paradigm. Incremental processing for iterative computation. Incremental iterative processing is substantially more challenging than incremental onestep processing because even a small number of updates may propagate to affect a large portion of secondary states after a number of iterations. To address this problem, propose to reuse the converged state from the previous computation and employ a change propagation control (CPC) mechanism. We also enhance the MRBG-Store to better support the access patterns in incremental iterative processing. i2mapreduce is the first MapReduce-based solution that efficiently supports incremental iterative computation. TECHNIQUES i 2 MapReduce Query Algorithm in MRBG-Store PageRank in MapReduce Kmeans in MapReduce GIM-V (Generalized Iterated Matrix- Vector multiplication) in MapReduce CONCLUSIONS We have described i2mapreduce, a MapReduce-based framework for incremental big data processing. i2mapreduce combines a fine-grain incremental engine, a general-purpose iterative model, and a set of effective techniques for incremental iterative computation. Real-machine experiments show that i2mapreduce can significantly reduce the run time for refreshing big data mining results compared to recomputation on both plain and iterative MapReduce. In Future evaluate MapReduce computation using a one-step algorithm and within four iterative algorithms with diverse computation characteristics. Experimental applications results show significant performance improvements MapReduce performing recomputation of compared to i 2 MapReduce. 67

REFERENCES [1] J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, in Proc. 6th Conf. Symp. Opear. Syst. Des. Implementation, 2004, p. 10. [2] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed datasets: A fault-tolerant abstraction for, inmemory cluster computing, in Proc. 9th USENIX Conf. Netw. Syst. Des. Implementation, 2012, p. 2. [3] R. Power and J. Li, Piccolo: Building fast, distributed programs with partitioned tables, in Proc. 9th USENIX Conf. Oper. Syst. Des. Implementation, 2010, pp. 1 14. Proc. VLDB Endowment, 2012, vol. 5, no. 11, pp. 1268 1279. [8] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, Haloop: Efficient iterative data processing on large clusters, in Proc. VLDB Endowment, 2010, vol. 3, no. 1 2, pp. 285 296. [9] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox, Twister: A runtime for iterative mapreduce, in Proc. 19th ACM Symp. High Performance Distributed Comput., 2010,pp. 810 818. [10] Y. Zhang, Q. Gao, L. Gao, and C. Wang, imapreduce: A distributed computing framework for iterative computation, J. Grid Comput., vol. 10, no. 1, pp. 47 68, 2012. [4] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, Pregel: A system for large-scale graph processing, in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010, pp. 135 146. [5] S. R. Mihaylov, Z. G. Ives, and S. Guha, Rex: Recursive, deltabased data-centric computation, in Proc. VLDB Endowment, 2012, vol. 5, no. 11, pp. 1280 1291. [6] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein, Distributed graphlab: A framework for machine learning and data mining in the cloud, in Proc. VLDB Endowment, 2012, vol. 5, no. 8, pp. 716 727. [7] S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl, Spinning fast iterative data flows, in 68