FINE-GRAIN INCREMENTAL PROCESSING OF MAPREDUCE AND MINING IN BIG DATA ENVIRONMENT S.SURESH KUMAR, Jay Shriram Group of Institutions Tirupur sureshmecse25@gmail.com Mr.A.M.RAVISHANKKAR M.E., Assistant Professor, Jay Shriram Group of Institutions, Tirupur ravi662shankkar@gmail.com Dr.S.RAJALAKSHMI Ph.D., Associate Professor, Jay Shriram Group of Institutions, Tirupur, mrajislm@gmail.com ABSTRACT Seeing that new data and updates are continuously arriving, the results of data mining applications turn out to be stale and obsolete over time. Incremental processing is a talented move towards to refreshing mining results. It utilizes previously saved states to avoid the expense of re-computation from scratch. Suggest i 2 MapReduce, a work of fiction incremental processing extension to MapReduce, the mostly used structure for mining data. Compared with the state-of-the-art work on Incoop, i 2 MapReduce performs key-value pair level incremental processing somewhat than task level re-computation, supports not only one-step computation but also more sophisticated iterative computation, which is widely used in data mining applications, and incorporates a set of tale techniques to reduce I/O overhead for accessing preserved fine-grain computation states. Evaluate i2mapreduce using a one-step algorithm and four iterative algorithms with diverse computation description. Keywords Big Data, MapReduce, Incremental, and Datamining. 1. INTRODUCTION In this work, the MapReduce framework offers techniques for convenient, distributed processing of data by enabling a simple coding model that reduces the burden of implementing a complex logic or infrastructure for parallelization, data transfer, fault tolerance and scheduling. An important property of the workloads processed by MapReduce applications is that they are often incremental by nature; i.e., MapReduce jobs often run repeatedly with small changes in their input. For example, search engines will periodically crawl the Web and perform various computations on this input, such as computing a Web index or the PageRank metric, often with very small modifications. This growing nature of data suggests that performing large-scale computations incrementally can improve efficiency dramatically. Broadly speaking there are two methods to reach such efficient incremental updates. The first method would be to devise systems that provide the programmer with facilities to save and use state across successive runs so that only computations that are affected by the changes to the input would need to be executed. This is precisely the strategy taken by major Internet companies who developed systems like Percolator or CBP. However this method, requires adopting a new programming model and a new Application Programming Interface (API) that differs from the one used by MapReduce. These new APIs also require the coder to devise a way to process updates efficiently, which can improves 61
algorithmic and software complexity. Research in the algorithms community on algorithms for dynamically or dynamic data show that such algorithms can be very complex even for problems that are relatively straightforward in the non-incremental case. The second method would be to develop systems that can reuse the results of prior computation transparently. This method would shift the complexity of incremental processing from the programmer to the processing system, essentially keeping the spirit of high-level models such as MapReduce. A few methods have taken this approach, e.g., DryadInc, and Nectar, in the context of the Dryad system by providing techniques for task-level or LINQ expression-level memorization. Fig. 1 Architecture diagram The design of Incoop contains the following new methods that we incorporated into the Hadoop MapReduce framework, and which showed to be instrumental in achieving efficient incremental computations. Incremental HDFS. Instead of relying on HDFS to store the input to MapReduce jobs, we devise a file system called Inc-HDFS that provides mechanisms to identify similarities in the input data of consecutive job runs. The idea is to divide the input into chunks whose boundaries depend on the file contents so that small changes to input don t change all chunk boundaries. Therefore partitions the input in a way that maximizes the opportunities for reusing results from previous computations, while preserving compatibility with HDFS, by providing the same interface and semantics. Contraction phase. We introduce methods for controlling the granularity of tasks so that large tasks can be divided into smaller subtasks that can be re-used even when the large tasks cannot. This is particularly challenging in Reduce tasks, whose granularity depends solely on their input. Our solution is to introduce a new Contraction phase that leverages Combiner functions, normally used to reduce network traffic by anticipating a small part of the processing done by Reducer tasks, to control the granularity of the Reduce tasks. Memoization-aware scheduler. To increase effectiveness of memoization, we suggest an affinitybased scheduler that uses a work stealing algorithm to reduce the amount of data movement across machines. Our new scheduler strikes a balance between exploiting the locality of previously computed results and executing tasks on any available machine to prevent straggling effects. Use cases. We employ Incoop to demonstrate two important use cases of incremental processing: incremental log processing, where we use Incoop to build a framework to incrementally process logs as more entries are added to them; and incremental query processing, where we layer the Pig framework on top of Incoop to enable relational query processing on continuously arriving data. 62
When we design and implement an incremental computation framework, many factors including algorithm accuracy, run time, and space overhead need to be taken into consideration. This paper puts focus on the transplantation of parallel algorithms based on MapReduce model and compatibility of nonincremental and incremental processing. A parallel programming framework is presented, which aims to be compatible with the original MapReduce APIs so that programmers do not need to rewrite the algorithms. It makes the following contributions: Incremental MapReduce framework. It supports incremental data input, incremental data processing, intermediate state preservation, incremental map and reduce functions. The input manager can dynamically discover new inputs and then submit jobs to the master node. Dynamic resource allocation based on the state. The state provides an important reference for resource request and allocation of the next execution. State information includes prior processing results, intermediate results, execution time, and the number of reduce tasks. Input data, acting as the observation, will change the current state into a new state after the job finishes. Friendly APIs for applications. Method submitjob() in Class JobClient is overloaded. Users only need to follow the method parameters to submit their jobs without modifying their algorithms or application programs. Furthermore, for continuous inputs, users can get updated outputs in time. 2 RELATED WORKS The Page Rank algorithm computes ranking scores of web pages based on the web graph structure for supporting web search. The web graph structure is constantly evolving web pages and hyper-links are created, deleted, and updated. As the underlying web graph evolves, the PageRank ranking results gradually become stale, potentially lowering the quality of web search. It is desirable to refresh the PageRank computation regularly. Incremental processing is a promising approach to refreshing mining results. Given the size of the input big data, it is often very expensive to rerun the entire computation from scratch. Incremental processing exploits the fact that the input data of two subsequent computations A and B are similar. Only a very small fraction of the input data has changed. The idea is to save states in computation A, re-use A s states in computation B, and perform recomputation only for states that are affected by the changed input data. A number of existing studies have followed this principle and designed new programming models to support incremental processing. The new programming models are drastically different from MapReduce, requiring programmers to completely re-implement their algorithms. Incoop extends MapReduce to support incremental processing. It has two main limitations. First, Incoop supports only task-level incremental processing. It saves and reuses states at the granularity of individual Map and Reduce tasks. Each task typically processes a large number of key-value pairs. If Incoop detects any data changes in the input of a task, it will rerun the entire task. While this approach easily leverages existing MapReduce features for state savings, it may incur a large amount of redundant computation if only a small fraction of kv-pairs have changed in a task. Second, Incoop supports only onestep computation, while important mining algorithms, such as PageRank, require iterative computation. Incoop would treat each iteration as a separate MapReduce job. A small number of input data changes may gradually propagate to affect a large portion of 63
intermediate states after a number of iterations, resulting in expensive global re-computation afterwards. The computation is broken down into a sequence of supersteps. In each superstep, a Compute function is invoked on each vertex. It communicates with other vertices by sending and receiving messages and performs computation for the current vertex. This method can efficiently support a large number of iterative graph algorithms. It provides a group wise processing operator Translate that takes state as an explicit input to support incremental analysis. But it adopts a new programming model that is very different from MapReduce. Several research studies support incremental processing by task-level re-computation, but they require users to manipulate the states on their own. In contrast, i2mapreduce exploits a fine-grain kvpair level re-computation that are more advantageous. Incremental processing for iterative application, Proposes a timely dataflow paradigm that allows state full computation and arbitrary nested iterations. To support incremental iterative computation, programmers have to completely rewrite their MapReduce programs. Extend the widely used MapReduce model for incremental iterative computation. Previous Map-Reduce programs can be slightly changed to run on i2mapreduce for incremental processing. A. Limitations of MapReduce MapReduce programming model is popular due to its simplicity and high efficiency. In MapReduce framework, when a job is submitted, the related input is portioned into fixed size pieces called blocks which are located in different data nodes. A job creates multiple splits according to the number of blocks and each split is processed independently as the input of a separate map task. A map task in a worker node runs the userdefined map function for each record in the split and writes its output to the local disk. When there are multiple reducers, the map task divides their outputs and each partition is for one reduce task. Reduce tasks will write their outputs to the distributed file system. Despite its powerful automatic parallelization with strong fault-tolerance, the original MapReduce exhibits the following limitations. Stateless. When map and reduce tasks finish, they write their outputs to a local disk or a distributed file system, and then inform the scheduler. When a job completed, related intermediate outputs will be deleted by a cleanup mechanism. When new data or input arrives, a new job needs to be created to process it. This is just like HTTP, a stateless protocol, which provides no means of storing a user s data between requests. To some extend, we can also say that MapReduce model is stateless. Stage independent. A MapReduce job can be divided into two stages: map stage and reduce stage. Each stage will not interrupt the other s execution. In the map stage, each computing thread executes the map method according to the input split allocated to it, and writes the output to the local node. In the reduce stage, each reduce thread fetches input from designate nodes, executes the reduce method, and writes the output to the specified file system. All map tasks and reduce tasks execute their codes without disturbing each other. Singe-step. Map tasks and reduce tasks will execute only once orderly for a job. Map tasks may finish at different times, and reduce tasks start copying their outputs as soon as all map tasks complete successfully. B. Extension of MapReduce Because of the limitations listed above, MapReduce model has to be extended for incremental 64
computation. They feature different input patterns, separate or coupled control approaches between state and dataflow, compatible or new added interfaces. Batch parallel processing refers to those that provide high efficient large scale parallel computation with one batch input and produce one output such as Google MapReduce, Hadoop and Dyrad/DryadLINQ. Incremental algorithms, more complicated than the original algorithms, can improve the runtime by modifying algorithms. Its input includes newly added data and the latest running result. Continuous bulk processing is another kind of solution to support incremental processing by providing new primitives for developers to design delicate dataflow oriented applications. It takes the secondary results of the prior executions as a part of explicit input. CBP of Yahoo and Percolator of Google both provide such incremental computation frameworks. Incremental computation based on MapReduce, making full use of MapReduce programming model, supports incremental processing by modifying the kernel implementation of map and reduces stages. Because HDFS does not support appends currently, some approaches also modify the distributed file system to support incremental data discovery and intermediate result storage. IncMR is an improved method for large-scale incremental data processing. The framework, inherits the simplicity of the original MapReduce model. It doesn t modify HDFS and still uses the same APIs of MapReduce. All algorithms or programs can complete incremental data processing without any modification. Compatibility. Original MapReduce interfaces are retained to avoid incompatibility with the existing implementation of applications. New interfaces are added by overloading methods. HDFS is still used without modification. Transparency. Users don t need to know how their incremental data are processed. All state data including their storage paths are transparent to users. Reduced resource usage. Computing resource request and allocation are determined by the historical state information and current added data size. Dynamic scheduling decision is useful for minimizing the overhead. Several important modules are added in IncMR framework. The input manager is used to find the newly added data automatically. According to the execution delay configuration or input size threshold, it determines if a new data processing will be started. Job scheduler, different from traditional job scheduler which determines the number of tasks to run only according to the configuration file defined in advance, takes the state into consideration when choosing nodes for reduce tasks. A state manager store all needed information for incremental jobs and provides decision support for job scheduler. Output manager is responsible for the storage and update of all outputs. C. Locality control and optimization Job scheduler is responsible for creating map tasks and reduce tasks. The main overhead of IncMR lies in the storage and transmission of many intermediate results. In the shuffle phase, the framework fetches the relevant partition of the output of all the mappers to each reduce task node via HTTP. The shuffle and sort phases occur simultaneously; while mapoutputs are being fetched they are merged. Locality control is always used for optimization of job and task scheduling. Naive locality control. When there is only one reduce task and it is always located in the same node, the reduce task can fetch cached state from the local node directly without needing to fetch the prior map outputs 65
from other nodes. Basically, it is the simplest to control the data flow when there is only one reduce task. Complex locality control. If the number of reduce tasks is more than 1 and is not changed, job scheduler will try to allocate reduce tasks to the nodes that have performed reduce tasks recently because they have cached related state. What s more, the fetched prior results are also sorted. This locality control will save plenty of time for data transferring. If the number of reduce tasks or the position of reduce task is changed, state fetching and data repartition operations can t be avoided. Iterative MapReduce. Iterative computing widely exists in data mining, text processing, graph processing, and other data intensive and computing-intensive applications. MapReduce programming paradigm is designed initially for single step execution. Its high efficiency and simplicity attract a lot of enthusiasm of applying it in iterative environment and algorithms. HaLoop, a runtime based on Hadoop, supports iterative data analysis applications especially for large-scale data. By caching the related invariant data and reducers local outputs for one job, it can execute recursively. Twister is a lightweight MapReduce runtime. It uses publish/subscribe messaging infrastructure for communication and supports iterative task execution. In order to provide solutions for applications such as data mining or social network analysis with relational data, imapreduce is designed to support automatically processing iterative tasks by reusing the prior task processors and eliminating the shuffle load of the static data. Additionally, iterative computing is an indispensable part in incremental processing. We also apply related iterative computing methods in our IncMR framework. especially for unstructured data. Reference implements an ad-hoc data processing abstraction in a distributed stream processor based on MapReduce programming model to support continuous inputs. MapReduce Online adopts pipelining technique within a job and between jobs, supports single-job and multi-job online aggregation, and also provides database continuous queries over data streams. Reference combines MapReduce programming model with the continuous query model characterized by Cut-Rewind to process dynamic stream data chunk. CMR, continuous MapReduce, is an architecture for continuous and largescale data analysis. Continuous processing is a special case of incremental processing. Although the input is continuous, the processing is discrete. Time interval or delay is an important factor to be considered. IncMR presented in this paper supports continuous processing. Applications based on MapReduce. When programming in MapReduce framework, all we should do is to prepare the input data, implement the mapper, the reducer, and optionally, the combiner and the partitioner. The execution is handled transparently by the framework in clusters ranging from a single node to thousands of nodes with the data-set ranging from gigabytes to petabytes and different data structures including rational database, text, graph, video and audio. Through a simple interface with two functions, map and reduce, MapReduce model facilitates parallel implementation of many real-world tasks such as data processing for search engines and machine learning. MapReduce is now popularly used in scientific computing fields too. When the new input arrives, it is a challenge for applications to continuously deal with it based on the original MapReduce. This paper addresses the problem in general. Continuous MapReduce. Ad-hoc data processing is a critical paradigm for wide-scale data processing 66
3 OUR WORK We Propose i2mapreduce, an extension to MapReduce that supports fine-grain incremental processing for both one step and iterative computation. Compared to previous solutions, i2mapreduce incorporates the following three novel features: Fine-grain incremental processing using MRBG- Store Incoop, i2mapreduce supports kv-pair level fine-grain incremental processing in order to minimize the amount of re-computation as much as possible. Model the kv-pair level data flow and data dependence in a MapReduce computation as a bipartite graph, called MRBGraph. A MRBG-Store is designed to preserve the fine-grain states in the MRBGraph and support efficient queries to retrieve fine-grain states for incremental processing. General-purpose iterative computation With modest extension to MapReduce API, previous work proposed imapreduce to efficiently support iterative computation on the MapReduce platform. It targets types of iterative computation where there is a one-to-one/all-to-one correspondence from Reduce output to Map input. In comparison, our current method provides general-purpose support, including not only one-to-one, but also one-to-many, many-to-one, and many-to-many correspondence. Enhance the Map API to allow users to easily express loop-invariant structure data, and propose a Project API function to express the correspondence from Reduce to Map. While users need to slightly change their algorithms in order to take full advantage of i2mapreduce, such modification is modest compared to the effort to reimplement algorithms on a completely different programming paradigm. Incremental processing for iterative computation. Incremental iterative processing is substantially more challenging than incremental onestep processing because even a small number of updates may propagate to affect a large portion of secondary states after a number of iterations. To address this problem, propose to reuse the converged state from the previous computation and employ a change propagation control (CPC) mechanism. We also enhance the MRBG-Store to better support the access patterns in incremental iterative processing. i2mapreduce is the first MapReduce-based solution that efficiently supports incremental iterative computation. TECHNIQUES i 2 MapReduce Query Algorithm in MRBG-Store PageRank in MapReduce Kmeans in MapReduce GIM-V (Generalized Iterated Matrix- Vector multiplication) in MapReduce CONCLUSIONS We have described i2mapreduce, a MapReduce-based framework for incremental big data processing. i2mapreduce combines a fine-grain incremental engine, a general-purpose iterative model, and a set of effective techniques for incremental iterative computation. Real-machine experiments show that i2mapreduce can significantly reduce the run time for refreshing big data mining results compared to recomputation on both plain and iterative MapReduce. In Future evaluate MapReduce computation using a one-step algorithm and within four iterative algorithms with diverse computation characteristics. Experimental applications results show significant performance improvements MapReduce performing recomputation of compared to i 2 MapReduce. 67
REFERENCES [1] J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, in Proc. 6th Conf. Symp. Opear. Syst. Des. Implementation, 2004, p. 10. [2] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed datasets: A fault-tolerant abstraction for, inmemory cluster computing, in Proc. 9th USENIX Conf. Netw. Syst. Des. Implementation, 2012, p. 2. [3] R. Power and J. Li, Piccolo: Building fast, distributed programs with partitioned tables, in Proc. 9th USENIX Conf. Oper. Syst. Des. Implementation, 2010, pp. 1 14. Proc. VLDB Endowment, 2012, vol. 5, no. 11, pp. 1268 1279. [8] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, Haloop: Efficient iterative data processing on large clusters, in Proc. VLDB Endowment, 2010, vol. 3, no. 1 2, pp. 285 296. [9] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox, Twister: A runtime for iterative mapreduce, in Proc. 19th ACM Symp. High Performance Distributed Comput., 2010,pp. 810 818. [10] Y. Zhang, Q. Gao, L. Gao, and C. Wang, imapreduce: A distributed computing framework for iterative computation, J. Grid Comput., vol. 10, no. 1, pp. 47 68, 2012. [4] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, Pregel: A system for large-scale graph processing, in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010, pp. 135 146. [5] S. R. Mihaylov, Z. G. Ives, and S. Guha, Rex: Recursive, deltabased data-centric computation, in Proc. VLDB Endowment, 2012, vol. 5, no. 11, pp. 1280 1291. [6] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein, Distributed graphlab: A framework for machine learning and data mining in the cloud, in Proc. VLDB Endowment, 2012, vol. 5, no. 8, pp. 716 727. [7] S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl, Spinning fast iterative data flows, in 68