Frameworks for Graph-Based Problems
|
|
- Frederica Fisher
- 5 years ago
- Views:
Transcription
1 Frameworks for Graph-Based Problems Dakshil Shah U.G. Student Computer Engineering Department Dwarkadas J. Sanghvi College of Engineering, Mumbai, India Chetashri Bhadane Assistant Professor Computer Engineering Department Dwarkadas J. Sanghvi College of Engineering, Mumbai, India Aishwarya Sinh U.G. Student Computer Engineering Department Dwarkadas J. Sanghvi College of Engineering, Mumbai, India ABSTRACT There has been a tremendous growth of graph-structured data in modern applications. Social networks and knowledge bases are some such applications, which create a crucial need for architectures that can process dense graphs. MapReduce is a programming model used in large-scale data parallel applications. Despite its notable role in big data analytics, MapReduce is not suited for large scale graph processing. Due to these limitations of MapReduce, GraphLab is used as an alternative for large dense graph processing applications. In this paper, PageRank algorithm is explained with respect to both the implementations. The same are then compared and a conclusion is drawn based on the capabilities of both the methods in solving graph-based problems. Keywords Big Data, MapReduce, PageRank, GraphLab, Graph Based Problems, Pregel walk over the link structure will arrive at a particular node. Nodes with high in-degrees tend to have a high PageRank. P is the PageRank of a page n where G is the total number of pages in the graph, α is the random jump factor, L (n) is the set of pages that link to n, and C (m) is the out-degree of page m. PageRank is defined recursively. This leads to an iterative algorithm, which is similar in structure to a parallel breadth first search. There are a large number of webpages in existence today on the Internet. Google has estimated the number of unique web URLs to be over a trillion [2]. Calculating the PageRank of this collection needs a Big Data approach. 1. INTRODUCTION Graphs are an abstract way of representing connectivity using nodes and links, also known as vertices and edges. Edges can be one-directional or bi-directional. Nodes and edges may have some auxiliary information. Many problems are formulated and solved in terms of graphs. Some graph based problems are: Shortest Path problems, Network flow problems, Matching problems, 2- SAT problem, Graph coloring problem, Traveling Salesman Problem (TSP), PageRank. 2. PAGERANK The Web can be considered as a directed graph. The nodes are the pages and there is an arc from page 1 to page 2 if there are one or more links from 1 to 2. [1] PageRank is a probability distribution over nodes in the graph representing the likelihood that a random 47 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh 3. MAPREDUCE 3.1 Introduction to MapReduce Google introduced MapReduce. Apache has implemented its own version as part of Apache Hadoop. Thousands of commodity machines are used to process petabytes of data using clusters. This makes it suitable for programs that can be decomposed into parallel tasks. It consists of an execution runtime and Hadoop Distributed File System (HDFS). The partitioning of input data, scheduling the program s execution, managing inter-machine communication and dealing with machine failures is handled by the execution runtime. There are two primary methods: map and reduce. A job is partitioned into multiple tasks, which are then parallelized for execution on a cluster. In the map stage, every task fetches data from HDFS and splits it into records using the Record Reader. This is then
2 processed by a user defined function map (). The results are stored into temporary files, which contain many data partitions, which will be processed by the Reduce task. In the Reduce stage, every task goes through the following 3 phases [3]: 1) The data partitions are copied from remote Map tasks. 2) The data partitions are then sorted. 3) The sorted data partitions are further processed using user defined reduce () function. The results are then written back to the disk. The user sets in a configuration file the maximum number of parallelized tasks. Slots can be considered as tokens and only tasks that get a slot are allowed to run. This slot distribution is overseen by the scheduler, with the default scheduler being First in First out (FIFO). 3.2 PageRank Implementation in MapReduce PageRank algorithm is simplified by ignoring random jump factor and assuming no dangling nodes. Pseudo code of MapReduce: 1: class Mapper 2: method Map(nid n; node N) 3: p N:PageRank/ N.AdjacencyList 4: Emit(nid n, N) Pass along graph structure 5: for all nodeid m N.AdjacencyList do 6: Emit(nid m, p) Pass PageRank mass to neighbours 1: class Reducer 2: method Reduce(nid m, [p1, p2, ]) 3: M 4: for all p counts [p1, p2, ] do 5: if IsNode(p) then 6: M p Recover graph structure 7: else 8: s s + p Sum incoming PageRank contributions 9: M.PageRank s 10: Emit(nid m, node M) In the map phase, we evenly divide each node s PageRank mass and pass each piece along outgoing edges to neighbours. In the reduce phase PageRank contributions are summed up at each destination node. Each MapReduce job corresponds to one iteration of the above algorithm. Figure 1: Sample graph [4] Figure 2: After one iteration on MapReduce [4] The size of each box in Figure 2 is in proportion to its PageRank value. The PageRank mass is evenly distributed to each nodes adjacency list. In the reduce phase, all partial PageRank contributions are then summed together to get the updated values. The graph structure seen in Figure 1 is passed recursively. 3.3 Issues The MapReduce model does not provide any built in mechanism for communicating global state. Due to the representation being an adjacency matrix, inter node communication can occur only via direct links or intermediate nodes. Information can be passed only within the local graph structure. The local computation is carried out at each node and passed on to the neighbouring nodes. Convergence on the global graph is possible after multiple iterations. The shuffling and sorting steps of MapReduce carry out the passing of these partial results. The amount of intermediate data generated is dependent on the order of the number of edges. For a dense graph, run time would be wasted in copying this intermediate data across the network, which may lead to a run time of O (n 2 ) in the worst case where n is number of nodes. Thus, MapReduce algorithms are not feasible on large, dense graphs. Due to the large number of graph-based applications such as in social networks, a more efficient approach is needed. 48 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh
3 4. GRAPHLAB 4.1 Introduction to GraphLab MaReduce is unable to express statistical inference algorithms efficiently. Graphlab compactly expresses asynchronous algorithms with sparse computational dependencies. It ensures data consistency and a high degree of parallelism.[5] GraphLab can express complex computational dependencies using a data graph. It can express iterative parallel algorithms with dynamic scheduling. The data model consists of 2 parts: A directed data graph and a shared data table. The graph can be represented as G= (V, E). The data associated with vertex v is denoted as D v and data associated with edge ( by D (u v) The shared data table (SDT) is an associative map, T[Key] Value, between keys and arbitrary blocks of data. Computation in GraphLab is performed through an update function, which defines the local computation, or through the sync mechanism, that defines global aggregation. The Update Function is analogous to the Map in MapReduce. Update functions are permitted to access and modify overlapping contexts in the graph unlike the map function. The sync mechanism is analogous to the Reduce operation and runs concurrently with the update functions unlike in MapReduce. The GraphLab runtime engine determines the best order to run vertices in by relaxing the execution order requirements of the shared memory, thus enabling efficient distributed execution. However, the restriction that all vertices must be eventually run is imposed. GraphLab eliminates messages and isolates the user-defined algorithm from data movement. Due to this, the system can choose how and when to move program state. It enables the algorithm designer to distinguish between data that is shared with all neighbours and data that is shared with a particular neighbour. This is done by allowing mutable data to be associated with both vertices and edges. GraphLab does not differentiate between edge directions. The asynchronous execution behaviour depends on the number of machines and the availability of network resources. This leads to nondeterminism that can complicate algorithm design and debugging. The sequential model of the GraphLab abstraction is translated into parallel execution by allowing multiple processors to run the same loop on the same graph, removing and running different vertices simultaneously. It ensures that overlapping computation is not run simultaneously, so as to retain the semantics of sequential execution. Graphlab enforces serializabilty to overcome the above, so that every parallel execution of vertex oriented programs has a corresponding sequential executionto do this, it prevents adjacent vertex programs from running concurrently by using a fine grained locking protocol that requires sequentially grabbing locks on all neighbouring vertices[6]. 4.2 PageRank Implementation in GraphLab 1)Data Graph Every vertex (v) is related to a webpage Every edge (u,v) is related to a link (u v) Vertex data D(v) keeps the rank of the webpage R(v) Edge data Du v keeps the weight of the link (u v) 2) The update function The update function for PageRank assigns the computed weighted sum of the current ranks of neighbouring vertices as the rank of the current vertex. If the value of the current vertex changes more than a predefined threshold, only then the neighbours are listed for update making the algorithm adaptive. The update function for GraphLab takes as inputs a vertex and its scope v and Sv respectively. It returns as outputs the new version of the scope and a set of tasks T that is responsible for encoding future task executions. Update : (v,sv) (Sv,T ) Algorithm : PageRank update function Input: Vertex data R(v) from Sv Edge data {wu,v : u N[v]} from Sv Neighbour vertex data {R(u) : u N[v]} from Sv Algorithm: Rold(v) R(v) // old PageRank is saved R(v) α/n For each u N[v] do // Loop over neighbours 49 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh
4 R(v) R(v) + (1 α) wu,v R(u) /* If the PageRank changes sufficiently if R(v) Rold(v) > then schedule neighbors to be updated*/ return {(PageRankFun,u) : u N[v]} Output: Modified scope Sv with new R(v) 3) The sync operation The sync operation is defined as a tuple as follows:- (Key,Fold,Merge,Finalize,acc(0),τ) It consists of a unique key, three user defined functions, an initial accumulator value, and an integer representing the interval between sync operations. The Fold and Merge functions are used by the sync operation to perform a Global Synchronous Reduce. Fold aggregates vertex data. The intermediate Fold results are combined by Merge. A transformation on the final value is performed by the Finalize function and the result is stored. The Key is used by update functions to access the most recent result of the sync operation Algorithm : GraphLab Execution Model Input: Data Graph G = (V,E,D) Input: Initial task set T = {(f,v1),(g,v2),...} Initial set of syncs:(name,fold,merge,finalize,acc(0),τ) Algorithm: while T is not Empty do (f,v) RemoveNext(T )1 (T 0,Sv) f(v,sv)2 T T T 03 Run all Sync operations which are ready Output: Modified Data Graph G = (V,E,D0) Result of Sync operations 5. PREGEL 5.1 Introduction to Pregel Pregel[7] is a scalable infrastructure developed by Google to mine data contained in a variety of 50 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh graphs. In order to solve graph-based problems, programs are developed as a sequence of iterations. In every iteration, a vertex is capable of receiving messages sent to it in the previous iteration, send messages to other vertices independent of other vertices. The vertices may also modify its edge states and alter the graphs topology. Pregel programs can be scaled automatically on a cluster [8]. Vertices carry out the graph computation, while the edges are responsible for communication of these computed results between vertices. Edges do not participate in computation [9]. The input is a directed graph. The input is initialized, followed by a sequence of supersteps and terminates at the end of the algorithm. In each superstep, the same user-defined function for the given algorithm executes in parallel, with computation being done in the vertices. The termination of the algorithm depends upon every vertex voting to halt [8]. When a vertex votes to halt, it deactivates itself in the process. The framework in subsequent iterations will not execute this vertex unless reactivated by a message. When all the vertices are in an inactive state simultaneously, the algorithm terminates. 5.2 PageRank Implementation in Pregel Algorithm[10] class PageRank function ComputeAtVertex(Message msg) if(superstep() >= 1) sum=0 while(!msg->done) sum+=msg->value() msg->next() mutablevalue()=0.15/numvertces() +dampingfactor*sum if(superstep()>maxset) VoteToHalt() else n=getoutedgeiterator().size() SendMessageToAllNeighbours(GetValue()/n) The vertices store the intermediate PageRank value which is computed. On crossing the maximum number of set supersteps, no further messages are sent and the vertices vote to halt. The algorithm is generally run until a convergence is achieved.
5 6. COMPARISON Most advanced machine learning and data mining algorithms focus on modelling dependencies between data. But data parallel abstractions like MapReduce fails when there are computational dependencies in data. For expressing such computational dependencies, graph-parallel abstractions like GraphLab and Pregel simplify the design and implementation of graph-parallel algorithms. This is achieved by freeing the user to focus on sequential computation rather than parallel movement of data. While the MapReduce abstraction can be run iteratively, there is no mechanism provided to directly carry out iterative computation. As a result, it is not possible to express sophisticated scheduling, automatically assess termination, or leverage basic data persistence. Unlike MapReduce, GraphLab can provide sophisticated scheduling primitives that can express iterative parallel algorithms with dynamic scheduling. Table 1. Comparison MapReduce GraphLab Pregel Messages Yes Eliminated Yes File System Hadoop Distributed File System Shared Data Table Google File System Programming model Parallelism Architecture model Task/Vertex Scheduling model Application Suitability Computational Model Synchronous Asynchronous Synchronous Sparse dependency No Yes Yes Iterative With help Yes Yes of extensions [5] Sharedmemory Sharedmemory Messagepassing Data- Graph- Graph- Parallel Parallel Parallel Master- Peer-to-Peer Master- Slave Slave Pull-based Push-based Push-based Loosely- Connected Strongly- Connected Applications Strongly- Connected Applications 51 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh GraphLab gives a faster and more efficient runtime performance than MapReduce and eliminates messages. The Update Function in GraphLab is analogous to the Map in MapReduce, but unlike in MapReduce, update functions are permitted to access and modify overlapping contexts in the graph. The sync mechanism is analogous to the Reduce operation, but unlike in MapReduce, the sync mechanism runs concurrently with the update functions.[5] Pregel is more effective in handling iterative processing in comparison with MapReduce[10]. Pregel keeps vertices & edges on the machine that performs computation, uses the network to transfers only messages whereas MapReduce passes the entire state of the graph from one stage to the next, and needs to coordinate the steps of a chained MapReduce. 7. CONCLUSION Both MapReduce and Pregel are synchronous computation models while GraphLab is asynchronous. GraphLab and Pregel both support iterative algorithms and are graph parallel. MapReduce does support iterative algorithms with the help of extensions and is data parallel. In this paper we reviewed three frameworks with respect to graph based problems by comparing their mechanisms for solving the same, with GraphLab and Pregel more suited for graph based problems in comparison to MapReduce. 8. REFERENCES [1] A. Rajaraman and J. Ullman, Mining of massive datasets. New York, N.Y.: Cambridge University Press, [2] J. Alpert and N. Hajaj, 'We knew the web was big...', Google Blog, [Online]. Available: [2] [Accessed: 08- Sep- 2015]. [3] Dong, Xicheng, Ying Wang, and Huaming Liao. "Scheduling mixed real-time and non-real-time applications in mapreduce environment." Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on. IEEE, [4] Lin, Jimmy, and Chris Dyer. "Data-intensive text processing with MapReduce."Synthesis Lectures on Human Language Technologies 3.1 (2010): [5] S. Sakr, 'Processing large-scale graph data: A guide to current technology', Ibm.com, [Online]. Available:
6 [5] [Accessed: 16- Sep- 2015]. [6] Hindman, Benjamin, et al. "Nexus: A common substrate for cluster computing."workshop on Hot Topics in Cloud Computing [7] Low, Yucheng, et al. "Graphlab: A new framework for parallel machine learning." arxiv preprint arxiv: (2014).. [8] G. Czajkowski, 'Large-scale graph computing at Google', Google Research Blog, [Online]. Available: [8] research.blogspot.in /2009/06/large-scale-graph-computing-atgoogle.html. [Accessed: 15- Sep- 2015]. [9] Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing."proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, [10] Han, Minyang, et al. "An experimental comparison of pregel-like graph processing systems." Proceedings of the VLDB Endowment 7.12 (2014): [11] Jiang, Dawei, et al. "epic: an extensible and scalable system for processing big data." Proceedings of the VLDB Endowment 7.7 (2014): [12] 'The GraphLab Abstraction', Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh
Graph Processing. Connor Gramazio Spiros Boosalis
Graph Processing Connor Gramazio Spiros Boosalis Pregel why not MapReduce? semantics: awkward to write graph algorithms efficiency: mapreduces serializes state (e.g. all nodes and edges) while pregel keeps
More informationPREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING
PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski Google, Inc. SIGMOD 2010 Presented by Ke Hong (some figures borrowed from
More informationPregel: A System for Large-Scale Graph Proces sing
Pregel: A System for Large-Scale Graph Proces sing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkwoski Google, Inc. SIGMOD July 20 Taewhi
More informationAuthors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.
Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Speaker: Chong Li Department: Applied Health Science Program: Master of Health Informatics 1 Term
More informationParallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem
I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 60 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Pregel: A System for Large-Scale Graph Processing
More informationMapReduce Design Patterns
MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together
More informationGraph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web
Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some
More informationPregel. Ali Shah
Pregel Ali Shah s9alshah@stud.uni-saarland.de 2 Outline Introduction Model of Computation Fundamentals of Pregel Program Implementation Applications Experiments Issues with Pregel 3 Outline Costs of Computation
More informationmodern database systems lecture 10 : large-scale graph processing
modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs
More informationMAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti
International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department
More informationPREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING
PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) SIGMOD 2010 Presented by : Xiu
More informationBig Graph Processing. Fenggang Wu Nov. 6, 2016
Big Graph Processing Fenggang Wu Nov. 6, 2016 Agenda Project Publication Organization Pregel SIGMOD 10 Google PowerGraph OSDI 12 CMU GraphX OSDI 14 UC Berkeley AMPLab PowerLyra EuroSys 15 Shanghai Jiao
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany GraphLab GraphLab 1 / 34 Outline 1. Introduction
More information[CoolName++]: A Graph Processing Framework for Charm++
[CoolName++]: A Graph Processing Framework for Charm++ Hassan Eslami, Erin Molloy, August Shi, Prakalp Srivastava Laxmikant V. Kale Charm++ Workshop University of Illinois at Urbana-Champaign {eslami2,emolloy2,awshi2,psrivas2,kale}@illinois.edu
More informationLarge Scale Graph Processing Pregel, GraphLab and GraphX
Large Scale Graph Processing Pregel, GraphLab and GraphX Amir H. Payberah amir@sics.se KTH Royal Institute of Technology Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 1 / 76 Amir H. Payberah
More informationDistributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 21. Graph Computing Frameworks Paul Krzyzanowski Rutgers University Fall 2016 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Can we make MapReduce easier? November 21, 2016 2014-2016
More informationPREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options
Data Management in the Cloud PREGEL AND GIRAPH Thanks to Kristin Tufte 1 Why Pregel? Processing large graph problems is challenging Options Custom distributed infrastructure Existing distributed computing
More informationData Partitioning and MapReduce
Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,
More informationPregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010
Pregel: A System for Large- Scale Graph Processing Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 1 Graphs are hard Poor locality of memory access Very
More informationLarge-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC
Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC Lecture material is mostly home-grown, partly taken with permission and courtesy from Professor Shih-Wei
More informationPutting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21
Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files
More informationCOSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph
COSC 6339 Big Data Analytics Graph Algorithms and Apache Giraph Parts of this lecture are adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationMapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Ctd. Graphs Pig Design Patterns Hadoop Ctd. Giraph Zoo Keeper Spark Spark Ctd. Learning objectives
More informationCloud Computing CS
Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part
More informationGlobal Journal of Engineering Science and Research Management
A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV
More informationKing Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing
King Abdullah University of Science and Technology CS348: Cloud Computing Large-Scale Graph Processing Zuhair Khayyat 10/March/2013 The Importance of Graphs A graph is a mathematical structure that represents
More informationI ++ Mapreduce: Incremental Mapreduce for Mining the Big Data
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. IV (May-Jun. 2016), PP 125-129 www.iosrjournals.org I ++ Mapreduce: Incremental Mapreduce for
More informationLecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!
Lecture 11: Graph algorithms!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the scenes of MapReduce:
More informationDatabase Applications (15-415)
Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More information1. Introduction to MapReduce
Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationGraph Algorithms. Chapter 5
Chapter Graph Algorithms Graphs are ubiquitous in modern society: examples encountered by almost everyone on a daily basis include the hyperlink structure of the web (simply known as the web graph), social
More informationHigh Performance Computing on MapReduce Programming Framework
International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming
More informationGiraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi
Giraph: Large-scale graph processing infrastructure on Hadoop Qu Zhi Why scalable graph processing? Web and social graphs are at immense scale and continuing to grow In 2008, Google estimated the number
More informationiihadoop: an asynchronous distributed framework for incremental iterative computations
DOI 10.1186/s40537-017-0086-3 RESEARCH Open Access iihadoop: an asynchronous distributed framework for incremental iterative computations Afaf G. Bin Saadon * and Hoda M. O. Mokhtar *Correspondence: eng.afaf.fci@gmail.com
More informationHaLoop Efficient Iterative Data Processing on Large Clusters
HaLoop Efficient Iterative Data Processing on Large Clusters Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst University of Washington Department of Computer Science & Engineering Presented
More informationCS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data.
Distributed Systems 1. Graph Computing Frameworks Can we make MapReduce easier? Paul Krzyzanowski Rutgers University Fall 016 1 Apache Pig Apache Pig Why? Make it easy to use MapReduce via scripting instead
More informationGraphs (Part II) Shannon Quinn
Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University) Parallel Graph Computation Distributed computation
More informationMapReduce: Simplified Data Processing on Large Clusters 유연일민철기
MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationSurvey Paper on Traditional Hadoop and Pipelined Map Reduce
International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,
More informationGraphHP: A Hybrid Platform for Iterative Graph Processing
GraphHP: A Hybrid Platform for Iterative Graph Processing Qun Chen, Song Bai, Zhanhuai Li, Zhiying Gou, Bo Suo and Wei Pan Northwestern Polytechnical University Xi an, China {chenbenben, baisong, lizhh,
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationMapReduce: Algorithm Design for Relational Operations
MapReduce: Algorithm Design for Relational Operations Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec Projection π Projection in MapReduce Easy Map over tuples, emit
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large
More informationDistributed Computation Models
Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case
More informationCS 5220: Parallel Graph Algorithms. David Bindel
CS 5220: Parallel Graph Algorithms David Bindel 2017-11-14 1 Graphs Mathematically: G = (V, E) where E V V Convention: V = n and E = m May be directed or undirected May have weights w V : V R or w E :
More informationJordan Boyd-Graber University of Maryland. Thursday, March 3, 2011
Data-Intensive Information Processing Applications! Session #5 Graph Algorithms Jordan Boyd-Graber University of Maryland Thursday, March 3, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationSurvey on MapReduce Scheduling Algorithms
Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used
More informationData-Intensive Computing with MapReduce
Data-Intensive Computing with MapReduce Session 5: Graph Processing Jimmy Lin University of Maryland Thursday, February 21, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationImplementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b
International Conference on Artificial Intelligence and Engineering Applications (AIEA 2016) Implementation of Parallel CASINO Algorithm Based on MapReduce Li Zhang a, Yijie Shi b State key laboratory
More informationApache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM
Apache Giraph: Facebook-scale graph processing infrastructure 3/31/2014 Avery Ching, Facebook GDM Motivation Apache Giraph Inspired by Google s Pregel but runs on Hadoop Think like a vertex Maximum value
More informationOne Trillion Edges. Graph processing at Facebook scale
One Trillion Edges Graph processing at Facebook scale Introduction Platform improvements Compute model extensions Experimental results Operational experience How Facebook improved Apache Giraph Facebook's
More informationProgramming Models MapReduce
Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases
More informationLink Analysis in the Cloud
Cloud Computing Link Analysis in the Cloud Dell Zhang Birkbeck, University of London 2017/18 Graph Problems & Representations What is a Graph? G = (V,E), where V represents the set of vertices (nodes)
More informationFREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India
Volume 115 No. 7 2017, 105-110 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN Balaji.N 1,
More informationarxiv: v1 [cs.db] 26 Apr 2012
Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud Yucheng Low Carnegie Mellon University ylow@cs.cmu.edu Joseph Gonzalez Carnegie Mellon University jegonzal@cs.cmu.edu
More informationGraph-Parallel Problems. ML in the Context of Parallel Architectures
Case Study 4: Collaborative Filtering Graph-Parallel Problems Synchronous v. Asynchronous Computation Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 20 th, 2014
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationMap-Reduce and Adwords Problem
Map-Reduce and Adwords Problem Map-Reduce and Adwords Problem Miłosz Kadziński Institute of Computing Science Poznan University of Technology, Poland www.cs.put.poznan.pl/mkadzinski/wpi Big Data (1) Big
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationGiraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems
Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems ABSTRACT Minyang Han David R. Cheriton School of Computer Science University of Waterloo m25han@uwaterloo.ca
More informationHadoop Map Reduce 10/17/2018 1
Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationImplementation of Aggregation of Map and Reduce Function for Performance Improvisation
2016 IJSRSET Volume 2 Issue 5 Print ISSN: 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Implementation of Aggregation of Map and Reduce Function for Performance Improvisation
More informationMitigating Data Skew Using Map Reduce Application
Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,
More informationInternational Journal of Advance Engineering and Research Development. A Study: Hadoop Framework
Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja
More informationCS-2510 COMPUTER OPERATING SYSTEMS
CS-2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application Functional
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationA Parallel Community Detection Algorithm for Big Social Networks
A Parallel Community Detection Algorithm for Big Social Networks Yathrib AlQahtani College of Computer and Information Sciences King Saud University Collage of Computing and Informatics Saudi Electronic
More informationHadoop/MapReduce Computing Paradigm
Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications
More informationMassive Online Analysis - Storm,Spark
Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R
More informationMemory-Optimized Distributed Graph Processing. through Novel Compression Techniques
Memory-Optimized Distributed Graph Processing through Novel Compression Techniques Katia Papakonstantinopoulou Joint work with Panagiotis Liakos and Alex Delis University of Athens Athens Colloquium in
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationDept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,
More informationCLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationAn Improved Performance Evaluation on Large-Scale Data using MapReduce Technique
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 8: Analyzing Graphs, Redux (1/2) March 20, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationSQL-to-MapReduce Translation for Efficient OLAP Query Processing
, pp.61-70 http://dx.doi.org/10.14257/ijdta.2017.10.6.05 SQL-to-MapReduce Translation for Efficient OLAP Query Processing with MapReduce Hyeon Gyu Kim Department of Computer Engineering, Sahmyook University,
More informationDistributed Computations MapReduce. adapted from Jeff Dean s slides
Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)
More informationA Comparative study of Clustering Algorithms using MapReduce in Hadoop
A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering
More informationSTATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns
STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big
More informationIntroduction To Graphs and Networks. Fall 2013 Carola Wenk
Introduction To Graphs and Networks Fall 203 Carola Wenk On the Internet, links are essentially weighted by factors such as transit time, or cost. The goal is to find the shortest path from one node to
More informationDFA-G: A Unified Programming Model for Vertex-centric Parallel Graph Processing
SCHOOL OF COMPUTER SCIENCE AND ENGINEERING DFA-G: A Unified Programming Model for Vertex-centric Parallel Graph Processing Bo Suo, Jing Su, Qun Chen, Zhanhuai Li, Wei Pan 2016-08-19 1 ABSTRACT Many systems
More informationThe amount of data increases every day Some numbers ( 2012):
1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect
More informationOutline. Graphs. Divide and Conquer.
GRAPHS COMP 321 McGill University These slides are mainly compiled from the following resources. - Professor Jaehyun Park slides CS 97SI - Top-coder tutorials. - Programming Challenges books. Outline Graphs.
More information2/26/2017. The amount of data increases every day Some numbers ( 2012):
The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationUniversity of Maryland. Tuesday, March 2, 2010
Data-Intensive Information Processing Applications Session #5 Graph Algorithms Jimmy Lin University of Maryland Tuesday, March 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationGraph Algorithms. Revised based on the slides by Ruoming Kent State
Graph Algorithms Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs
More information