FINE-GRAIN INCREMENTAL PROCESSING OF MAPREDUCE AND MINING IN BIG DATA ENVIRONMENT

Size: px
Start display at page:

Download "FINE-GRAIN INCREMENTAL PROCESSING OF MAPREDUCE AND MINING IN BIG DATA ENVIRONMENT"

Transcription

1 FINE-GRAIN INCREMENTAL PROCESSING OF MAPREDUCE AND MINING IN BIG DATA ENVIRONMENT S.SURESH KUMAR, Jay Shriram Group of Institutions Tirupur Mr.A.M.RAVISHANKKAR M.E., Assistant Professor, Jay Shriram Group of Institutions, Tirupur Dr.S.RAJALAKSHMI Ph.D., Associate Professor, Jay Shriram Group of Institutions, Tirupur, ABSTRACT Seeing that new data and updates are continuously arriving, the results of data mining applications turn out to be stale and obsolete over time. Incremental processing is a talented move towards to refreshing mining results. It utilizes previously saved states to avoid the expense of re-computation from scratch. Suggest i 2 MapReduce, a work of fiction incremental processing extension to MapReduce, the mostly used structure for mining data. Compared with the state-of-the-art work on Incoop, i 2 MapReduce performs key-value pair level incremental processing somewhat than task level re-computation, supports not only one-step computation but also more sophisticated iterative computation, which is widely used in data mining applications, and incorporates a set of tale techniques to reduce I/O overhead for accessing preserved fine-grain computation states. Evaluate i2mapreduce using a one-step algorithm and four iterative algorithms with diverse computation description. Keywords Big Data, MapReduce, Incremental, and Datamining. 1. INTRODUCTION In this work, the MapReduce framework offers techniques for convenient, distributed processing of data by enabling a simple coding model that reduces the burden of implementing a complex logic or infrastructure for parallelization, data transfer, fault tolerance and scheduling. An important property of the workloads processed by MapReduce applications is that they are often incremental by nature; i.e., MapReduce jobs often run repeatedly with small changes in their input. For example, search engines will periodically crawl the Web and perform various computations on this input, such as computing a Web index or the PageRank metric, often with very small modifications. This growing nature of data suggests that performing large-scale computations incrementally can improve efficiency dramatically. Broadly speaking there are two methods to reach such efficient incremental updates. The first method would be to devise systems that provide the programmer with facilities to save and use state across successive runs so that only computations that are affected by the changes to the input would need to be executed. This is precisely the strategy taken by major Internet companies who developed systems like Percolator or CBP. However this method, requires adopting a new programming model and a new Application Programming Interface (API) that differs from the one used by MapReduce. These new APIs also require the coder to devise a way to process updates efficiently, which can improves 61

2 algorithmic and software complexity. Research in the algorithms community on algorithms for dynamically or dynamic data show that such algorithms can be very complex even for problems that are relatively straightforward in the non-incremental case. The second method would be to develop systems that can reuse the results of prior computation transparently. This method would shift the complexity of incremental processing from the programmer to the processing system, essentially keeping the spirit of high-level models such as MapReduce. A few methods have taken this approach, e.g., DryadInc, and Nectar, in the context of the Dryad system by providing techniques for task-level or LINQ expression-level memorization. Fig. 1 Architecture diagram The design of Incoop contains the following new methods that we incorporated into the Hadoop MapReduce framework, and which showed to be instrumental in achieving efficient incremental computations. Incremental HDFS. Instead of relying on HDFS to store the input to MapReduce jobs, we devise a file system called Inc-HDFS that provides mechanisms to identify similarities in the input data of consecutive job runs. The idea is to divide the input into chunks whose boundaries depend on the file contents so that small changes to input don t change all chunk boundaries. Therefore partitions the input in a way that maximizes the opportunities for reusing results from previous computations, while preserving compatibility with HDFS, by providing the same interface and semantics. Contraction phase. We introduce methods for controlling the granularity of tasks so that large tasks can be divided into smaller subtasks that can be re-used even when the large tasks cannot. This is particularly challenging in Reduce tasks, whose granularity depends solely on their input. Our solution is to introduce a new Contraction phase that leverages Combiner functions, normally used to reduce network traffic by anticipating a small part of the processing done by Reducer tasks, to control the granularity of the Reduce tasks. Memoization-aware scheduler. To increase effectiveness of memoization, we suggest an affinitybased scheduler that uses a work stealing algorithm to reduce the amount of data movement across machines. Our new scheduler strikes a balance between exploiting the locality of previously computed results and executing tasks on any available machine to prevent straggling effects. Use cases. We employ Incoop to demonstrate two important use cases of incremental processing: incremental log processing, where we use Incoop to build a framework to incrementally process logs as more entries are added to them; and incremental query processing, where we layer the Pig framework on top of Incoop to enable relational query processing on continuously arriving data. 62

3 When we design and implement an incremental computation framework, many factors including algorithm accuracy, run time, and space overhead need to be taken into consideration. This paper puts focus on the transplantation of parallel algorithms based on MapReduce model and compatibility of nonincremental and incremental processing. A parallel programming framework is presented, which aims to be compatible with the original MapReduce APIs so that programmers do not need to rewrite the algorithms. It makes the following contributions: Incremental MapReduce framework. It supports incremental data input, incremental data processing, intermediate state preservation, incremental map and reduce functions. The input manager can dynamically discover new inputs and then submit jobs to the master node. Dynamic resource allocation based on the state. The state provides an important reference for resource request and allocation of the next execution. State information includes prior processing results, intermediate results, execution time, and the number of reduce tasks. Input data, acting as the observation, will change the current state into a new state after the job finishes. Friendly APIs for applications. Method submitjob() in Class JobClient is overloaded. Users only need to follow the method parameters to submit their jobs without modifying their algorithms or application programs. Furthermore, for continuous inputs, users can get updated outputs in time. 2 RELATED WORKS The Page Rank algorithm computes ranking scores of web pages based on the web graph structure for supporting web search. The web graph structure is constantly evolving web pages and hyper-links are created, deleted, and updated. As the underlying web graph evolves, the PageRank ranking results gradually become stale, potentially lowering the quality of web search. It is desirable to refresh the PageRank computation regularly. Incremental processing is a promising approach to refreshing mining results. Given the size of the input big data, it is often very expensive to rerun the entire computation from scratch. Incremental processing exploits the fact that the input data of two subsequent computations A and B are similar. Only a very small fraction of the input data has changed. The idea is to save states in computation A, re-use A s states in computation B, and perform recomputation only for states that are affected by the changed input data. A number of existing studies have followed this principle and designed new programming models to support incremental processing. The new programming models are drastically different from MapReduce, requiring programmers to completely re-implement their algorithms. Incoop extends MapReduce to support incremental processing. It has two main limitations. First, Incoop supports only task-level incremental processing. It saves and reuses states at the granularity of individual Map and Reduce tasks. Each task typically processes a large number of key-value pairs. If Incoop detects any data changes in the input of a task, it will rerun the entire task. While this approach easily leverages existing MapReduce features for state savings, it may incur a large amount of redundant computation if only a small fraction of kv-pairs have changed in a task. Second, Incoop supports only onestep computation, while important mining algorithms, such as PageRank, require iterative computation. Incoop would treat each iteration as a separate MapReduce job. A small number of input data changes may gradually propagate to affect a large portion of 63

4 intermediate states after a number of iterations, resulting in expensive global re-computation afterwards. The computation is broken down into a sequence of supersteps. In each superstep, a Compute function is invoked on each vertex. It communicates with other vertices by sending and receiving messages and performs computation for the current vertex. This method can efficiently support a large number of iterative graph algorithms. It provides a group wise processing operator Translate that takes state as an explicit input to support incremental analysis. But it adopts a new programming model that is very different from MapReduce. Several research studies support incremental processing by task-level re-computation, but they require users to manipulate the states on their own. In contrast, i2mapreduce exploits a fine-grain kvpair level re-computation that are more advantageous. Incremental processing for iterative application, Proposes a timely dataflow paradigm that allows state full computation and arbitrary nested iterations. To support incremental iterative computation, programmers have to completely rewrite their MapReduce programs. Extend the widely used MapReduce model for incremental iterative computation. Previous Map-Reduce programs can be slightly changed to run on i2mapreduce for incremental processing. A. Limitations of MapReduce MapReduce programming model is popular due to its simplicity and high efficiency. In MapReduce framework, when a job is submitted, the related input is portioned into fixed size pieces called blocks which are located in different data nodes. A job creates multiple splits according to the number of blocks and each split is processed independently as the input of a separate map task. A map task in a worker node runs the userdefined map function for each record in the split and writes its output to the local disk. When there are multiple reducers, the map task divides their outputs and each partition is for one reduce task. Reduce tasks will write their outputs to the distributed file system. Despite its powerful automatic parallelization with strong fault-tolerance, the original MapReduce exhibits the following limitations. Stateless. When map and reduce tasks finish, they write their outputs to a local disk or a distributed file system, and then inform the scheduler. When a job completed, related intermediate outputs will be deleted by a cleanup mechanism. When new data or input arrives, a new job needs to be created to process it. This is just like HTTP, a stateless protocol, which provides no means of storing a user s data between requests. To some extend, we can also say that MapReduce model is stateless. Stage independent. A MapReduce job can be divided into two stages: map stage and reduce stage. Each stage will not interrupt the other s execution. In the map stage, each computing thread executes the map method according to the input split allocated to it, and writes the output to the local node. In the reduce stage, each reduce thread fetches input from designate nodes, executes the reduce method, and writes the output to the specified file system. All map tasks and reduce tasks execute their codes without disturbing each other. Singe-step. Map tasks and reduce tasks will execute only once orderly for a job. Map tasks may finish at different times, and reduce tasks start copying their outputs as soon as all map tasks complete successfully. B. Extension of MapReduce Because of the limitations listed above, MapReduce model has to be extended for incremental 64

5 computation. They feature different input patterns, separate or coupled control approaches between state and dataflow, compatible or new added interfaces. Batch parallel processing refers to those that provide high efficient large scale parallel computation with one batch input and produce one output such as Google MapReduce, Hadoop and Dyrad/DryadLINQ. Incremental algorithms, more complicated than the original algorithms, can improve the runtime by modifying algorithms. Its input includes newly added data and the latest running result. Continuous bulk processing is another kind of solution to support incremental processing by providing new primitives for developers to design delicate dataflow oriented applications. It takes the secondary results of the prior executions as a part of explicit input. CBP of Yahoo and Percolator of Google both provide such incremental computation frameworks. Incremental computation based on MapReduce, making full use of MapReduce programming model, supports incremental processing by modifying the kernel implementation of map and reduces stages. Because HDFS does not support appends currently, some approaches also modify the distributed file system to support incremental data discovery and intermediate result storage. IncMR is an improved method for large-scale incremental data processing. The framework, inherits the simplicity of the original MapReduce model. It doesn t modify HDFS and still uses the same APIs of MapReduce. All algorithms or programs can complete incremental data processing without any modification. Compatibility. Original MapReduce interfaces are retained to avoid incompatibility with the existing implementation of applications. New interfaces are added by overloading methods. HDFS is still used without modification. Transparency. Users don t need to know how their incremental data are processed. All state data including their storage paths are transparent to users. Reduced resource usage. Computing resource request and allocation are determined by the historical state information and current added data size. Dynamic scheduling decision is useful for minimizing the overhead. Several important modules are added in IncMR framework. The input manager is used to find the newly added data automatically. According to the execution delay configuration or input size threshold, it determines if a new data processing will be started. Job scheduler, different from traditional job scheduler which determines the number of tasks to run only according to the configuration file defined in advance, takes the state into consideration when choosing nodes for reduce tasks. A state manager store all needed information for incremental jobs and provides decision support for job scheduler. Output manager is responsible for the storage and update of all outputs. C. Locality control and optimization Job scheduler is responsible for creating map tasks and reduce tasks. The main overhead of IncMR lies in the storage and transmission of many intermediate results. In the shuffle phase, the framework fetches the relevant partition of the output of all the mappers to each reduce task node via HTTP. The shuffle and sort phases occur simultaneously; while mapoutputs are being fetched they are merged. Locality control is always used for optimization of job and task scheduling. Naive locality control. When there is only one reduce task and it is always located in the same node, the reduce task can fetch cached state from the local node directly without needing to fetch the prior map outputs 65

6 from other nodes. Basically, it is the simplest to control the data flow when there is only one reduce task. Complex locality control. If the number of reduce tasks is more than 1 and is not changed, job scheduler will try to allocate reduce tasks to the nodes that have performed reduce tasks recently because they have cached related state. What s more, the fetched prior results are also sorted. This locality control will save plenty of time for data transferring. If the number of reduce tasks or the position of reduce task is changed, state fetching and data repartition operations can t be avoided. Iterative MapReduce. Iterative computing widely exists in data mining, text processing, graph processing, and other data intensive and computing-intensive applications. MapReduce programming paradigm is designed initially for single step execution. Its high efficiency and simplicity attract a lot of enthusiasm of applying it in iterative environment and algorithms. HaLoop, a runtime based on Hadoop, supports iterative data analysis applications especially for large-scale data. By caching the related invariant data and reducers local outputs for one job, it can execute recursively. Twister is a lightweight MapReduce runtime. It uses publish/subscribe messaging infrastructure for communication and supports iterative task execution. In order to provide solutions for applications such as data mining or social network analysis with relational data, imapreduce is designed to support automatically processing iterative tasks by reusing the prior task processors and eliminating the shuffle load of the static data. Additionally, iterative computing is an indispensable part in incremental processing. We also apply related iterative computing methods in our IncMR framework. especially for unstructured data. Reference implements an ad-hoc data processing abstraction in a distributed stream processor based on MapReduce programming model to support continuous inputs. MapReduce Online adopts pipelining technique within a job and between jobs, supports single-job and multi-job online aggregation, and also provides database continuous queries over data streams. Reference combines MapReduce programming model with the continuous query model characterized by Cut-Rewind to process dynamic stream data chunk. CMR, continuous MapReduce, is an architecture for continuous and largescale data analysis. Continuous processing is a special case of incremental processing. Although the input is continuous, the processing is discrete. Time interval or delay is an important factor to be considered. IncMR presented in this paper supports continuous processing. Applications based on MapReduce. When programming in MapReduce framework, all we should do is to prepare the input data, implement the mapper, the reducer, and optionally, the combiner and the partitioner. The execution is handled transparently by the framework in clusters ranging from a single node to thousands of nodes with the data-set ranging from gigabytes to petabytes and different data structures including rational database, text, graph, video and audio. Through a simple interface with two functions, map and reduce, MapReduce model facilitates parallel implementation of many real-world tasks such as data processing for search engines and machine learning. MapReduce is now popularly used in scientific computing fields too. When the new input arrives, it is a challenge for applications to continuously deal with it based on the original MapReduce. This paper addresses the problem in general. Continuous MapReduce. Ad-hoc data processing is a critical paradigm for wide-scale data processing 66

7 3 OUR WORK We Propose i2mapreduce, an extension to MapReduce that supports fine-grain incremental processing for both one step and iterative computation. Compared to previous solutions, i2mapreduce incorporates the following three novel features: Fine-grain incremental processing using MRBG- Store Incoop, i2mapreduce supports kv-pair level fine-grain incremental processing in order to minimize the amount of re-computation as much as possible. Model the kv-pair level data flow and data dependence in a MapReduce computation as a bipartite graph, called MRBGraph. A MRBG-Store is designed to preserve the fine-grain states in the MRBGraph and support efficient queries to retrieve fine-grain states for incremental processing. General-purpose iterative computation With modest extension to MapReduce API, previous work proposed imapreduce to efficiently support iterative computation on the MapReduce platform. It targets types of iterative computation where there is a one-to-one/all-to-one correspondence from Reduce output to Map input. In comparison, our current method provides general-purpose support, including not only one-to-one, but also one-to-many, many-to-one, and many-to-many correspondence. Enhance the Map API to allow users to easily express loop-invariant structure data, and propose a Project API function to express the correspondence from Reduce to Map. While users need to slightly change their algorithms in order to take full advantage of i2mapreduce, such modification is modest compared to the effort to reimplement algorithms on a completely different programming paradigm. Incremental processing for iterative computation. Incremental iterative processing is substantially more challenging than incremental onestep processing because even a small number of updates may propagate to affect a large portion of secondary states after a number of iterations. To address this problem, propose to reuse the converged state from the previous computation and employ a change propagation control (CPC) mechanism. We also enhance the MRBG-Store to better support the access patterns in incremental iterative processing. i2mapreduce is the first MapReduce-based solution that efficiently supports incremental iterative computation. TECHNIQUES i 2 MapReduce Query Algorithm in MRBG-Store PageRank in MapReduce Kmeans in MapReduce GIM-V (Generalized Iterated Matrix- Vector multiplication) in MapReduce CONCLUSIONS We have described i2mapreduce, a MapReduce-based framework for incremental big data processing. i2mapreduce combines a fine-grain incremental engine, a general-purpose iterative model, and a set of effective techniques for incremental iterative computation. Real-machine experiments show that i2mapreduce can significantly reduce the run time for refreshing big data mining results compared to recomputation on both plain and iterative MapReduce. In Future evaluate MapReduce computation using a one-step algorithm and within four iterative algorithms with diverse computation characteristics. Experimental applications results show significant performance improvements MapReduce performing recomputation of compared to i 2 MapReduce. 67

8 REFERENCES [1] J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, in Proc. 6th Conf. Symp. Opear. Syst. Des. Implementation, 2004, p. 10. [2] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, Resilient distributed datasets: A fault-tolerant abstraction for, inmemory cluster computing, in Proc. 9th USENIX Conf. Netw. Syst. Des. Implementation, 2012, p. 2. [3] R. Power and J. Li, Piccolo: Building fast, distributed programs with partitioned tables, in Proc. 9th USENIX Conf. Oper. Syst. Des. Implementation, 2010, pp Proc. VLDB Endowment, 2012, vol. 5, no. 11, pp [8] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, Haloop: Efficient iterative data processing on large clusters, in Proc. VLDB Endowment, 2010, vol. 3, no. 1 2, pp [9] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox, Twister: A runtime for iterative mapreduce, in Proc. 19th ACM Symp. High Performance Distributed Comput., 2010,pp [10] Y. Zhang, Q. Gao, L. Gao, and C. Wang, imapreduce: A distributed computing framework for iterative computation, J. Grid Comput., vol. 10, no. 1, pp , [4] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, Pregel: A system for large-scale graph processing, in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010, pp [5] S. R. Mihaylov, Z. G. Ives, and S. Guha, Rex: Recursive, deltabased data-centric computation, in Proc. VLDB Endowment, 2012, vol. 5, no. 11, pp [6] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein, Distributed graphlab: A framework for machine learning and data mining in the cloud, in Proc. VLDB Endowment, 2012, vol. 5, no. 8, pp [7] S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl, Spinning fast iterative data flows, in 68

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. IV (May-Jun. 2016), PP 125-129 www.iosrjournals.org I ++ Mapreduce: Incremental Mapreduce for

More information

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 01, 2016 ISSN (online):

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 01, 2016 ISSN (online): IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 01, 2016 ISSN (online): 2321-0613 Incremental Map Reduce Framework for Efficient Mining Evolving in Big Data Environment

More information

IMPROVING MAPREDUCE FOR MINING EVOLVING BIG DATA USING TOP K RULES

IMPROVING MAPREDUCE FOR MINING EVOLVING BIG DATA USING TOP K RULES IMPROVING MAPREDUCE FOR MINING EVOLVING BIG DATA USING TOP K RULES Vishakha B. Dalvi 1, Ranjit R. Keole 2 1CSIT, HVPM s College of Engineering & Technology, SGB Amravati University, Maharashtra, INDIA

More information

Incremental and Iterative Map reduce For Mining Evolving the Big Data in Banking System

Incremental and Iterative Map reduce For Mining Evolving the Big Data in Banking System 2016 IJSRSET Volume 2 Issue 2 Print ISSN : 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Incremental and Iterative Map reduce For Mining Evolving the Big Data in Banking

More information

Survey on Incremental MapReduce for Data Mining

Survey on Incremental MapReduce for Data Mining Survey on Incremental MapReduce for Data Mining Trupti M. Shinde 1, Prof.S.V.Chobe 2 1 Research Scholar, Computer Engineering Dept., Dr. D. Y. Patil Institute of Engineering &Technology, 2 Associate Professor,

More information

INCREMENTAL STREAMING DATA FOR MAPREDUCE IN BIG DATA USING HADOOP

INCREMENTAL STREAMING DATA FOR MAPREDUCE IN BIG DATA USING HADOOP INCREMENTAL STREAMING DATA FOR MAPREDUCE IN BIG DATA USING HADOOP S.Kavina 1, P.Kanmani 2 P.G.Scholar, CSE, K.S.Rangasamy College of Technology, Namakkal, Tamil Nadu, India 1 askkavina@gmail.com 1 Assistant

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

TODAY huge amount of digital data is being accumulated

TODAY huge amount of digital data is being accumulated 1906 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 7, JULY 2015 i 2 MapReduce: Incremental MapReduce for Mining Evolving Big Data Yanfeng Zhang, Shimin Chen, Qiang Wang, and Ge Yu,

More information

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski Google, Inc. SIGMOD 2010 Presented by Ke Hong (some figures borrowed from

More information

Shark: Hive on Spark

Shark: Hive on Spark Optional Reading (additional material) Shark: Hive on Spark Prajakta Kalmegh Duke University 1 What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries

More information

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma

More information

iihadoop: an asynchronous distributed framework for incremental iterative computations

iihadoop: an asynchronous distributed framework for incremental iterative computations DOI 10.1186/s40537-017-0086-3 RESEARCH Open Access iihadoop: an asynchronous distributed framework for incremental iterative computations Afaf G. Bin Saadon * and Hoda M. O. Mokhtar *Correspondence: eng.afaf.fci@gmail.com

More information

Twitter data Analytics using Distributed Computing

Twitter data Analytics using Distributed Computing Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE

More information

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Big Graph Processing. Fenggang Wu Nov. 6, 2016 Big Graph Processing Fenggang Wu Nov. 6, 2016 Agenda Project Publication Organization Pregel SIGMOD 10 Google PowerGraph OSDI 12 CMU GraphX OSDI 14 UC Berkeley AMPLab PowerLyra EuroSys 15 Shanghai Jiao

More information

Fast, Interactive, Language-Integrated Cluster Computing

Fast, Interactive, Language-Integrated Cluster Computing Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline - MapReduce: - - Resilient Distributed Datasets (RDD) - - Motivation Examples The Design and How it Works Performance Motivation

More information

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Memory-Optimized Distributed Graph Processing. through Novel Compression Techniques

Memory-Optimized Distributed Graph Processing. through Novel Compression Techniques Memory-Optimized Distributed Graph Processing through Novel Compression Techniques Katia Papakonstantinopoulou Joint work with Panagiotis Liakos and Alex Delis University of Athens Athens Colloquium in

More information

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US Outline Introduction and Background MapReduce Iterative MapReduce Distributed Workflow Management

More information

A NOVEL JOB SCHEDULING ALGORITHM TO ENHANCE EXECUTION AND RESOURCE MANAGEMENT USING MAP REDUCE

A NOVEL JOB SCHEDULING ALGORITHM TO ENHANCE EXECUTION AND RESOURCE MANAGEMENT USING MAP REDUCE A NOVEL JOB SCHEDULING ALGORITHM TO ENHANCE EXECUTION AND RESOURCE MANAGEMENT USING MAP REDUCE Ms. P.PRIYANKA 1, Mr. A. JAHIR HUSAIN 2 1 M.Tech., Dept. of Computer Science & Engineering, PRIST University,Vallam

More information

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Speaker: Chong Li Department: Applied Health Science Program: Master of Health Informatics 1 Term

More information

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf Spark: A Brief History https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf A Brief History: 2004 MapReduce paper 2010 Spark paper 2002 2004 2006 2008 2010 2012 2014 2002 MapReduce @ Google

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved

Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved Implementation of K-Means Clustering Algorithm in Hadoop Framework Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India Abstract Drastic growth

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models CIEL: A Universal Execution Engine for

More information

Massive Online Analysis - Storm,Spark

Massive Online Analysis - Storm,Spark Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model Today s content Resilient Distributed Datasets(RDDs) ------ Spark and its data model Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing -- Spark By Matei Zaharia,

More information

HaLoop Efficient Iterative Data Processing on Large Clusters

HaLoop Efficient Iterative Data Processing on Large Clusters HaLoop Efficient Iterative Data Processing on Large Clusters Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst University of Washington Department of Computer Science & Engineering Presented

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options Data Management in the Cloud PREGEL AND GIRAPH Thanks to Kristin Tufte 1 Why Pregel? Processing large graph problems is challenging Options Custom distributed infrastructure Existing distributed computing

More information

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,

More information

2/4/2019 Week 3- A Sangmi Lee Pallickara

2/4/2019 Week 3- A Sangmi Lee Pallickara Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1

More information

Large-scale Incremental Data Processing with Change Propagation

Large-scale Incremental Data Processing with Change Propagation Large-scale Incremental Data Processing with Change Propagation Pramod Bhatotia Alexander Wieder İstemi Ekin Akkuş Rodrigo Rodrigues Umut A. Acar Max Planck Institute for Software Systems (MPI-SWS) Abstract

More information

Parallel Computing: MapReduce Jin, Hai

Parallel Computing: MapReduce Jin, Hai Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';

More information

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

SQL-to-MapReduce Translation for Efficient OLAP Query Processing , pp.61-70 http://dx.doi.org/10.14257/ijdta.2017.10.6.05 SQL-to-MapReduce Translation for Efficient OLAP Query Processing with MapReduce Hyeon Gyu Kim Department of Computer Engineering, Sahmyook University,

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

B490 Mining the Big Data. 5. Models for Big Data

B490 Mining the Big Data. 5. Models for Big Data B490 Mining the Big Data 5. Models for Big Data Qin Zhang 1-1 2-1 MapReduce MapReduce The MapReduce model (Dean & Ghemawat 2004) Input Output Goal Map Shuffle Reduce Standard model in industry for massive

More information

Hadoop Map Reduce 10/17/2018 1

Hadoop Map Reduce 10/17/2018 1 Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018

More information

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Course Schedule Tuesday 10.3. Introduction and the Big Data Challenge Tuesday 17.3. MapReduce and Spark: Overview Tuesday

More information

Jumbo: Beyond MapReduce for Workload Balancing

Jumbo: Beyond MapReduce for Workload Balancing Jumbo: Beyond Reduce for Workload Balancing Sven Groot Supervised by Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 4-6-1 Komaba Meguro-ku, Tokyo 153-8505, Japan sgroot@tkl.iis.u-tokyo.ac.jp

More information

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences L3: Spark & RDD Department of Computational and Data Science, IISc, 2016 This

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

The HaLoop Approach to Large-Scale Iterative Data Analysis

The HaLoop Approach to Large-Scale Iterative Data Analysis Noname manuscript No. (will be inserted by the editor) The HaLoop Approach to Large-Scale Iterative Data Analysis Yingyi Bu Bill Howe Magdalena Balazinska Michael D. Ernst Received: date / Accepted: date

More information

Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation

Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1 Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation Yanfeng Zhang, Qixin Gao, Lixin Gao, Fellow,

More information

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica. Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of abstraction in cluster

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Map Reduce Group Meeting

Map Reduce Group Meeting Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for

More information

Oolong: Asynchronous Distributed Applications Made Easy

Oolong: Asynchronous Distributed Applications Made Easy Oolong: Asynchronous Distributed Applications Made Easy Christopher Mitchell Russell Power Jinyang Li New York University {cmitchell, power, jinyang}@cs.nyu.edu Abstract We present Oolong, a distributed

More information

Efficient Map Reduce Model with Hadoop Framework for Data Processing

Efficient Map Reduce Model with Hadoop Framework for Data Processing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Optimistic Recovery for Iterative Dataflows in Action

Optimistic Recovery for Iterative Dataflows in Action Optimistic Recovery for Iterative Dataflows in Action Sergey Dudoladov 1 Asterios Katsifodimos 1 Chen Xu 1 Stephan Ewen 2 Volker Markl 1 Sebastian Schelter 1 Kostas Tzoumas 2 1 Technische Universität Berlin

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD S.THIRUNAVUKKARASU 1, DR.K.P.KALIYAMURTHIE 2 Assistant Professor, Dept of IT, Bharath University, Chennai-73 1 Professor& Head, Dept of IT, Bharath

More information

Data-intensive computing systems

Data-intensive computing systems Data-intensive computing systems University of Verona Computer Science Department Damiano Carra Acknowledgements q Credits Part of the course material is based on slides provided by the following authors

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data

A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data Sachin Jadhav, Shubhangi Suryawanshi Abstract Nowadays, the volume of data is growing at an nprecedented rate, big

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1 MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.

More information

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 Pregel: A System for Large- Scale Graph Processing Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 1 Graphs are hard Poor locality of memory access Very

More information

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University Lightweight Streaming-based Runtime for Cloud Computing granules Shrideep Pallickara Community Grids Lab, Indiana University A unique confluence of factors have driven the need for cloud computing DEMAND

More information

CHAPTER 4 ROUND ROBIN PARTITIONING

CHAPTER 4 ROUND ROBIN PARTITIONING 79 CHAPTER 4 ROUND ROBIN PARTITIONING 4.1 INTRODUCTION The Hadoop Distributed File System (HDFS) is constructed to store immensely colossal data sets accurately and to send those data sets at huge bandwidth

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

modern database systems lecture 10 : large-scale graph processing

modern database systems lecture 10 : large-scale graph processing modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs

More information

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21 Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files

More information

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) SIGMOD 2010 Presented by : Xiu

More information

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big

More information

Apache Flink- A System for Batch and Realtime Stream Processing

Apache Flink- A System for Batch and Realtime Stream Processing Apache Flink- A System for Batch and Realtime Stream Processing Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich Prof Dr. Matthias Schubert 2016 Introduction to Apache Flink

More information

Map-Reduce. John Hughes

Map-Reduce. John Hughes Map-Reduce John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity computers networked together 1,000 computers 850GB each How to make them work together? Early Days Hundreds of ad-hoc

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud

Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud R. H. Jadhav 1 P.E.S college of Engineering, Aurangabad, Maharashtra, India 1 rjadhav377@gmail.com ABSTRACT: Many

More information