Frameworks for Graph-Based Problems

Similar documents
Graph Processing. Connor Gramazio Spiros Boosalis

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

Pregel: A System for Large-Scale Graph Proces sing

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

MapReduce Design Patterns

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Pregel. Ali Shah

modern database systems lecture 10 : large-scale graph processing

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Big Data Analytics. Lucas Rego Drumond

[CoolName++]: A Graph Processing Framework for Charm++

Large Scale Graph Processing Pregel, GraphLab and GraphX

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

Data Partitioning and MapReduce

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010

Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

TI2736-B Big Data Processing. Claudia Hauff

Cloud Computing CS

Global Journal of Engineering Science and Research Management

King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!

Database Applications (15-415)

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

1. Introduction to MapReduce

Introduction to MapReduce

Graph Algorithms. Chapter 5

High Performance Computing on MapReduce Programming Framework

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi

iihadoop: an asynchronous distributed framework for incremental iterative computations

HaLoop Efficient Iterative Data Processing on Large Clusters

CS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data.

Graphs (Part II) Shannon Quinn

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Databases 2 (VU) ( / )

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

GraphHP: A Hybrid Platform for Iterative Graph Processing

Harp-DAAL for High Performance Big Data Computing

MapReduce: Algorithm Design for Relational Operations

CS 347 Parallel and Distributed Data Processing

Distributed Computation Models

CS 5220: Parallel Graph Algorithms. David Bindel

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Survey on MapReduce Scheduling Algorithms

Data-Intensive Computing with MapReduce

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

One Trillion Edges. Graph processing at Facebook scale

Programming Models MapReduce

Link Analysis in the Cloud

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

arxiv: v1 [cs.db] 26 Apr 2012

Graph-Parallel Problems. ML in the Context of Parallel Architectures

CS 347 Parallel and Distributed Data Processing

Distributed computing: index building and use

Map-Reduce and Adwords Problem

Developing MapReduce Programs

Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems

Hadoop Map Reduce 10/17/2018 1

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Mitigating Data Skew Using Map Reduce Application

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

CS-2510 COMPUTER OPERATING SYSTEMS

Distributed computing: index building and use

A Parallel Community Detection Algorithm for Big Social Networks

Hadoop/MapReduce Computing Paradigm

Massive Online Analysis - Storm,Spark

Memory-Optimized Distributed Graph Processing. through Novel Compression Techniques

Introduction to MapReduce

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Dept. Of Computer Science, Colorado State University

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Clustering Lecture 8: MapReduce

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

Data-Intensive Distributed Computing

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

Distributed Computations MapReduce. adapted from Jeff Dean s slides

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Introduction To Graphs and Networks. Fall 2013 Carola Wenk

DFA-G: A Unified Programming Model for Vertex-centric Parallel Graph Processing

The amount of data increases every day Some numbers ( 2012):

Outline. Graphs. Divide and Conquer.

2/26/2017. The amount of data increases every day Some numbers ( 2012):

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

University of Maryland. Tuesday, March 2, 2010

Graph Algorithms. Revised based on the slides by Ruoming Kent State

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Transcription:

Frameworks for Graph-Based Problems Dakshil Shah U.G. Student Computer Engineering Department Dwarkadas J. Sanghvi College of Engineering, Mumbai, India Chetashri Bhadane Assistant Professor Computer Engineering Department Dwarkadas J. Sanghvi College of Engineering, Mumbai, India Aishwarya Sinh U.G. Student Computer Engineering Department Dwarkadas J. Sanghvi College of Engineering, Mumbai, India ABSTRACT There has been a tremendous growth of graph-structured data in modern applications. Social networks and knowledge bases are some such applications, which create a crucial need for architectures that can process dense graphs. MapReduce is a programming model used in large-scale data parallel applications. Despite its notable role in big data analytics, MapReduce is not suited for large scale graph processing. Due to these limitations of MapReduce, GraphLab is used as an alternative for large dense graph processing applications. In this paper, PageRank algorithm is explained with respect to both the implementations. The same are then compared and a conclusion is drawn based on the capabilities of both the methods in solving graph-based problems. Keywords Big Data, MapReduce, PageRank, GraphLab, Graph Based Problems, Pregel walk over the link structure will arrive at a particular node. Nodes with high in-degrees tend to have a high PageRank. P is the PageRank of a page n where G is the total number of pages in the graph, α is the random jump factor, L (n) is the set of pages that link to n, and C (m) is the out-degree of page m. PageRank is defined recursively. This leads to an iterative algorithm, which is similar in structure to a parallel breadth first search. There are a large number of webpages in existence today on the Internet. Google has estimated the number of unique web URLs to be over a trillion [2]. Calculating the PageRank of this collection needs a Big Data approach. 1. INTRODUCTION Graphs are an abstract way of representing connectivity using nodes and links, also known as vertices and edges. Edges can be one-directional or bi-directional. Nodes and edges may have some auxiliary information. Many problems are formulated and solved in terms of graphs. Some graph based problems are: Shortest Path problems, Network flow problems, Matching problems, 2- SAT problem, Graph coloring problem, Traveling Salesman Problem (TSP), PageRank. 2. PAGERANK The Web can be considered as a directed graph. The nodes are the pages and there is an arc from page 1 to page 2 if there are one or more links from 1 to 2. [1] PageRank is a probability distribution over nodes in the graph representing the likelihood that a random 47 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh 3. MAPREDUCE 3.1 Introduction to MapReduce Google introduced MapReduce. Apache has implemented its own version as part of Apache Hadoop. Thousands of commodity machines are used to process petabytes of data using clusters. This makes it suitable for programs that can be decomposed into parallel tasks. It consists of an execution runtime and Hadoop Distributed File System (HDFS). The partitioning of input data, scheduling the program s execution, managing inter-machine communication and dealing with machine failures is handled by the execution runtime. There are two primary methods: map and reduce. A job is partitioned into multiple tasks, which are then parallelized for execution on a cluster. In the map stage, every task fetches data from HDFS and splits it into records using the Record Reader. This is then

processed by a user defined function map (). The results are stored into temporary files, which contain many data partitions, which will be processed by the Reduce task. In the Reduce stage, every task goes through the following 3 phases [3]: 1) The data partitions are copied from remote Map tasks. 2) The data partitions are then sorted. 3) The sorted data partitions are further processed using user defined reduce () function. The results are then written back to the disk. The user sets in a configuration file the maximum number of parallelized tasks. Slots can be considered as tokens and only tasks that get a slot are allowed to run. This slot distribution is overseen by the scheduler, with the default scheduler being First in First out (FIFO). 3.2 PageRank Implementation in MapReduce PageRank algorithm is simplified by ignoring random jump factor and assuming no dangling nodes. Pseudo code of MapReduce: 1: class Mapper 2: method Map(nid n; node N) 3: p N:PageRank/ N.AdjacencyList 4: Emit(nid n, N) Pass along graph structure 5: for all nodeid m N.AdjacencyList do 6: Emit(nid m, p) Pass PageRank mass to neighbours 1: class Reducer 2: method Reduce(nid m, [p1, p2, ]) 3: M 4: for all p counts [p1, p2, ] do 5: if IsNode(p) then 6: M p Recover graph structure 7: else 8: s s + p Sum incoming PageRank contributions 9: M.PageRank s 10: Emit(nid m, node M) In the map phase, we evenly divide each node s PageRank mass and pass each piece along outgoing edges to neighbours. In the reduce phase PageRank contributions are summed up at each destination node. Each MapReduce job corresponds to one iteration of the above algorithm. Figure 1: Sample graph [4] Figure 2: After one iteration on MapReduce [4] The size of each box in Figure 2 is in proportion to its PageRank value. The PageRank mass is evenly distributed to each nodes adjacency list. In the reduce phase, all partial PageRank contributions are then summed together to get the updated values. The graph structure seen in Figure 1 is passed recursively. 3.3 Issues The MapReduce model does not provide any built in mechanism for communicating global state. Due to the representation being an adjacency matrix, inter node communication can occur only via direct links or intermediate nodes. Information can be passed only within the local graph structure. The local computation is carried out at each node and passed on to the neighbouring nodes. Convergence on the global graph is possible after multiple iterations. The shuffling and sorting steps of MapReduce carry out the passing of these partial results. The amount of intermediate data generated is dependent on the order of the number of edges. For a dense graph, run time would be wasted in copying this intermediate data across the network, which may lead to a run time of O (n 2 ) in the worst case where n is number of nodes. Thus, MapReduce algorithms are not feasible on large, dense graphs. Due to the large number of graph-based applications such as in social networks, a more efficient approach is needed. 48 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh

4. GRAPHLAB 4.1 Introduction to GraphLab MaReduce is unable to express statistical inference algorithms efficiently. Graphlab compactly expresses asynchronous algorithms with sparse computational dependencies. It ensures data consistency and a high degree of parallelism.[5] GraphLab can express complex computational dependencies using a data graph. It can express iterative parallel algorithms with dynamic scheduling. The data model consists of 2 parts: A directed data graph and a shared data table. The graph can be represented as G= (V, E). The data associated with vertex v is denoted as D v and data associated with edge ( by D (u v) The shared data table (SDT) is an associative map, T[Key] Value, between keys and arbitrary blocks of data. Computation in GraphLab is performed through an update function, which defines the local computation, or through the sync mechanism, that defines global aggregation. The Update Function is analogous to the Map in MapReduce. Update functions are permitted to access and modify overlapping contexts in the graph unlike the map function. The sync mechanism is analogous to the Reduce operation and runs concurrently with the update functions unlike in MapReduce. The GraphLab runtime engine determines the best order to run vertices in by relaxing the execution order requirements of the shared memory, thus enabling efficient distributed execution. However, the restriction that all vertices must be eventually run is imposed. GraphLab eliminates messages and isolates the user-defined algorithm from data movement. Due to this, the system can choose how and when to move program state. It enables the algorithm designer to distinguish between data that is shared with all neighbours and data that is shared with a particular neighbour. This is done by allowing mutable data to be associated with both vertices and edges. GraphLab does not differentiate between edge directions. The asynchronous execution behaviour depends on the number of machines and the availability of network resources. This leads to nondeterminism that can complicate algorithm design and debugging. The sequential model of the GraphLab abstraction is translated into parallel execution by allowing multiple processors to run the same loop on the same graph, removing and running different vertices simultaneously. It ensures that overlapping computation is not run simultaneously, so as to retain the semantics of sequential execution. Graphlab enforces serializabilty to overcome the above, so that every parallel execution of vertex oriented programs has a corresponding sequential executionto do this, it prevents adjacent vertex programs from running concurrently by using a fine grained locking protocol that requires sequentially grabbing locks on all neighbouring vertices[6]. 4.2 PageRank Implementation in GraphLab 1)Data Graph Every vertex (v) is related to a webpage Every edge (u,v) is related to a link (u v) Vertex data D(v) keeps the rank of the webpage R(v) Edge data Du v keeps the weight of the link (u v) 2) The update function The update function for PageRank assigns the computed weighted sum of the current ranks of neighbouring vertices as the rank of the current vertex. If the value of the current vertex changes more than a predefined threshold, only then the neighbours are listed for update making the algorithm adaptive. The update function for GraphLab takes as inputs a vertex and its scope v and Sv respectively. It returns as outputs the new version of the scope and a set of tasks T that is responsible for encoding future task executions. Update : (v,sv) (Sv,T ) Algorithm : PageRank update function Input: Vertex data R(v) from Sv Edge data {wu,v : u N[v]} from Sv Neighbour vertex data {R(u) : u N[v]} from Sv Algorithm: Rold(v) R(v) // old PageRank is saved R(v) α/n For each u N[v] do // Loop over neighbours 49 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh

R(v) R(v) + (1 α) wu,v R(u) /* If the PageRank changes sufficiently if R(v) Rold(v) > then schedule neighbors to be updated*/ return {(PageRankFun,u) : u N[v]} Output: Modified scope Sv with new R(v) 3) The sync operation The sync operation is defined as a tuple as follows:- (Key,Fold,Merge,Finalize,acc(0),τ) It consists of a unique key, three user defined functions, an initial accumulator value, and an integer representing the interval between sync operations. The Fold and Merge functions are used by the sync operation to perform a Global Synchronous Reduce. Fold aggregates vertex data. The intermediate Fold results are combined by Merge. A transformation on the final value is performed by the Finalize function and the result is stored. The Key is used by update functions to access the most recent result of the sync operation Algorithm : GraphLab Execution Model Input: Data Graph G = (V,E,D) Input: Initial task set T = {(f,v1),(g,v2),...} Initial set of syncs:(name,fold,merge,finalize,acc(0),τ) Algorithm: while T is not Empty do (f,v) RemoveNext(T )1 (T 0,Sv) f(v,sv)2 T T T 03 Run all Sync operations which are ready Output: Modified Data Graph G = (V,E,D0) Result of Sync operations 5. PREGEL 5.1 Introduction to Pregel Pregel[7] is a scalable infrastructure developed by Google to mine data contained in a variety of 50 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh graphs. In order to solve graph-based problems, programs are developed as a sequence of iterations. In every iteration, a vertex is capable of receiving messages sent to it in the previous iteration, send messages to other vertices independent of other vertices. The vertices may also modify its edge states and alter the graphs topology. Pregel programs can be scaled automatically on a cluster [8]. Vertices carry out the graph computation, while the edges are responsible for communication of these computed results between vertices. Edges do not participate in computation [9]. The input is a directed graph. The input is initialized, followed by a sequence of supersteps and terminates at the end of the algorithm. In each superstep, the same user-defined function for the given algorithm executes in parallel, with computation being done in the vertices. The termination of the algorithm depends upon every vertex voting to halt [8]. When a vertex votes to halt, it deactivates itself in the process. The framework in subsequent iterations will not execute this vertex unless reactivated by a message. When all the vertices are in an inactive state simultaneously, the algorithm terminates. 5.2 PageRank Implementation in Pregel Algorithm[10] class PageRank function ComputeAtVertex(Message msg) if(superstep() >= 1) sum=0 while(!msg->done) sum+=msg->value() msg->next() mutablevalue()=0.15/numvertces() +dampingfactor*sum if(superstep()>maxset) VoteToHalt() else n=getoutedgeiterator().size() SendMessageToAllNeighbours(GetValue()/n) The vertices store the intermediate PageRank value which is computed. On crossing the maximum number of set supersteps, no further messages are sent and the vertices vote to halt. The algorithm is generally run until a convergence is achieved.

6. COMPARISON Most advanced machine learning and data mining algorithms focus on modelling dependencies between data. But data parallel abstractions like MapReduce fails when there are computational dependencies in data. For expressing such computational dependencies, graph-parallel abstractions like GraphLab and Pregel simplify the design and implementation of graph-parallel algorithms. This is achieved by freeing the user to focus on sequential computation rather than parallel movement of data. While the MapReduce abstraction can be run iteratively, there is no mechanism provided to directly carry out iterative computation. As a result, it is not possible to express sophisticated scheduling, automatically assess termination, or leverage basic data persistence. Unlike MapReduce, GraphLab can provide sophisticated scheduling primitives that can express iterative parallel algorithms with dynamic scheduling. Table 1. Comparison MapReduce GraphLab Pregel Messages Yes Eliminated Yes File System Hadoop Distributed File System Shared Data Table Google File System Programming model Parallelism Architecture model Task/Vertex Scheduling model Application Suitability Computational Model Synchronous Asynchronous Synchronous Sparse dependency No Yes Yes Iterative With help Yes Yes of extensions [5] Sharedmemory Sharedmemory Messagepassing Data- Graph- Graph- Parallel Parallel Parallel Master- Peer-to-Peer Master- Slave Slave Pull-based Push-based Push-based Loosely- Connected Strongly- Connected Applications Strongly- Connected Applications 51 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh GraphLab gives a faster and more efficient runtime performance than MapReduce and eliminates messages. The Update Function in GraphLab is analogous to the Map in MapReduce, but unlike in MapReduce, update functions are permitted to access and modify overlapping contexts in the graph. The sync mechanism is analogous to the Reduce operation, but unlike in MapReduce, the sync mechanism runs concurrently with the update functions.[5] Pregel is more effective in handling iterative processing in comparison with MapReduce[10]. Pregel keeps vertices & edges on the machine that performs computation, uses the network to transfers only messages whereas MapReduce passes the entire state of the graph from one stage to the next, and needs to coordinate the steps of a chained MapReduce. 7. CONCLUSION Both MapReduce and Pregel are synchronous computation models while GraphLab is asynchronous. GraphLab and Pregel both support iterative algorithms and are graph parallel. MapReduce does support iterative algorithms with the help of extensions and is data parallel. In this paper we reviewed three frameworks with respect to graph based problems by comparing their mechanisms for solving the same, with GraphLab and Pregel more suited for graph based problems in comparison to MapReduce. 8. REFERENCES [1] A. Rajaraman and J. Ullman, Mining of massive datasets. New York, N.Y.: Cambridge University Press, 2012. [2] J. Alpert and N. Hajaj, 'We knew the web was big...', Google Blog, 2008. [Online]. Available: [2] http://googleblog.blogspot.in/2008/07/we-knewweb-was-big.html. [Accessed: 08- Sep- 2015]. [3] Dong, Xicheng, Ying Wang, and Huaming Liao. "Scheduling mixed real-time and non-real-time applications in mapreduce environment." Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on. IEEE, 2011. [4] Lin, Jimmy, and Chris Dyer. "Data-intensive text processing with MapReduce."Synthesis Lectures on Human Language Technologies 3.1 (2010): 1-177. [5] S. Sakr, 'Processing large-scale graph data: A guide to current technology', Ibm.com, 2013. [Online]. Available:

[5]http://www.ibm.com/developerworks/library/osgiraph/. [Accessed: 16- Sep- 2015]. [6] Hindman, Benjamin, et al. "Nexus: A common substrate for cluster computing."workshop on Hot Topics in Cloud Computing. 2009. [7] Low, Yucheng, et al. "Graphlab: A new framework for parallel machine learning." arxiv preprint arxiv:1408.2041 (2014).. [8] G. Czajkowski, 'Large-scale graph computing at Google', Google Research Blog, 2009. [Online]. Available: [8] http://google research.blogspot.in /2009/06/large-scale-graph-computing-atgoogle.html. [Accessed: 15- Sep- 2015]. [9] Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing."proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010. [10] Han, Minyang, et al. "An experimental comparison of pregel-like graph processing systems." Proceedings of the VLDB Endowment 7.12 (2014): 1047-1058. [11] Jiang, Dawei, et al. "epic: an extensible and scalable system for processing big data." Proceedings of the VLDB Endowment 7.7 (2014): 541-552. [12] 'The GraphLab Abstraction', 2015. 52 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh