Frameworks for Graph-Based Problems Dakshil Shah U.G. Student Computer Engineering Department Dwarkadas J. Sanghvi College of Engineering, Mumbai, India Chetashri Bhadane Assistant Professor Computer Engineering Department Dwarkadas J. Sanghvi College of Engineering, Mumbai, India Aishwarya Sinh U.G. Student Computer Engineering Department Dwarkadas J. Sanghvi College of Engineering, Mumbai, India ABSTRACT There has been a tremendous growth of graph-structured data in modern applications. Social networks and knowledge bases are some such applications, which create a crucial need for architectures that can process dense graphs. MapReduce is a programming model used in large-scale data parallel applications. Despite its notable role in big data analytics, MapReduce is not suited for large scale graph processing. Due to these limitations of MapReduce, GraphLab is used as an alternative for large dense graph processing applications. In this paper, PageRank algorithm is explained with respect to both the implementations. The same are then compared and a conclusion is drawn based on the capabilities of both the methods in solving graph-based problems. Keywords Big Data, MapReduce, PageRank, GraphLab, Graph Based Problems, Pregel walk over the link structure will arrive at a particular node. Nodes with high in-degrees tend to have a high PageRank. P is the PageRank of a page n where G is the total number of pages in the graph, α is the random jump factor, L (n) is the set of pages that link to n, and C (m) is the out-degree of page m. PageRank is defined recursively. This leads to an iterative algorithm, which is similar in structure to a parallel breadth first search. There are a large number of webpages in existence today on the Internet. Google has estimated the number of unique web URLs to be over a trillion [2]. Calculating the PageRank of this collection needs a Big Data approach. 1. INTRODUCTION Graphs are an abstract way of representing connectivity using nodes and links, also known as vertices and edges. Edges can be one-directional or bi-directional. Nodes and edges may have some auxiliary information. Many problems are formulated and solved in terms of graphs. Some graph based problems are: Shortest Path problems, Network flow problems, Matching problems, 2- SAT problem, Graph coloring problem, Traveling Salesman Problem (TSP), PageRank. 2. PAGERANK The Web can be considered as a directed graph. The nodes are the pages and there is an arc from page 1 to page 2 if there are one or more links from 1 to 2. [1] PageRank is a probability distribution over nodes in the graph representing the likelihood that a random 47 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh 3. MAPREDUCE 3.1 Introduction to MapReduce Google introduced MapReduce. Apache has implemented its own version as part of Apache Hadoop. Thousands of commodity machines are used to process petabytes of data using clusters. This makes it suitable for programs that can be decomposed into parallel tasks. It consists of an execution runtime and Hadoop Distributed File System (HDFS). The partitioning of input data, scheduling the program s execution, managing inter-machine communication and dealing with machine failures is handled by the execution runtime. There are two primary methods: map and reduce. A job is partitioned into multiple tasks, which are then parallelized for execution on a cluster. In the map stage, every task fetches data from HDFS and splits it into records using the Record Reader. This is then
processed by a user defined function map (). The results are stored into temporary files, which contain many data partitions, which will be processed by the Reduce task. In the Reduce stage, every task goes through the following 3 phases [3]: 1) The data partitions are copied from remote Map tasks. 2) The data partitions are then sorted. 3) The sorted data partitions are further processed using user defined reduce () function. The results are then written back to the disk. The user sets in a configuration file the maximum number of parallelized tasks. Slots can be considered as tokens and only tasks that get a slot are allowed to run. This slot distribution is overseen by the scheduler, with the default scheduler being First in First out (FIFO). 3.2 PageRank Implementation in MapReduce PageRank algorithm is simplified by ignoring random jump factor and assuming no dangling nodes. Pseudo code of MapReduce: 1: class Mapper 2: method Map(nid n; node N) 3: p N:PageRank/ N.AdjacencyList 4: Emit(nid n, N) Pass along graph structure 5: for all nodeid m N.AdjacencyList do 6: Emit(nid m, p) Pass PageRank mass to neighbours 1: class Reducer 2: method Reduce(nid m, [p1, p2, ]) 3: M 4: for all p counts [p1, p2, ] do 5: if IsNode(p) then 6: M p Recover graph structure 7: else 8: s s + p Sum incoming PageRank contributions 9: M.PageRank s 10: Emit(nid m, node M) In the map phase, we evenly divide each node s PageRank mass and pass each piece along outgoing edges to neighbours. In the reduce phase PageRank contributions are summed up at each destination node. Each MapReduce job corresponds to one iteration of the above algorithm. Figure 1: Sample graph [4] Figure 2: After one iteration on MapReduce [4] The size of each box in Figure 2 is in proportion to its PageRank value. The PageRank mass is evenly distributed to each nodes adjacency list. In the reduce phase, all partial PageRank contributions are then summed together to get the updated values. The graph structure seen in Figure 1 is passed recursively. 3.3 Issues The MapReduce model does not provide any built in mechanism for communicating global state. Due to the representation being an adjacency matrix, inter node communication can occur only via direct links or intermediate nodes. Information can be passed only within the local graph structure. The local computation is carried out at each node and passed on to the neighbouring nodes. Convergence on the global graph is possible after multiple iterations. The shuffling and sorting steps of MapReduce carry out the passing of these partial results. The amount of intermediate data generated is dependent on the order of the number of edges. For a dense graph, run time would be wasted in copying this intermediate data across the network, which may lead to a run time of O (n 2 ) in the worst case where n is number of nodes. Thus, MapReduce algorithms are not feasible on large, dense graphs. Due to the large number of graph-based applications such as in social networks, a more efficient approach is needed. 48 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh
4. GRAPHLAB 4.1 Introduction to GraphLab MaReduce is unable to express statistical inference algorithms efficiently. Graphlab compactly expresses asynchronous algorithms with sparse computational dependencies. It ensures data consistency and a high degree of parallelism.[5] GraphLab can express complex computational dependencies using a data graph. It can express iterative parallel algorithms with dynamic scheduling. The data model consists of 2 parts: A directed data graph and a shared data table. The graph can be represented as G= (V, E). The data associated with vertex v is denoted as D v and data associated with edge ( by D (u v) The shared data table (SDT) is an associative map, T[Key] Value, between keys and arbitrary blocks of data. Computation in GraphLab is performed through an update function, which defines the local computation, or through the sync mechanism, that defines global aggregation. The Update Function is analogous to the Map in MapReduce. Update functions are permitted to access and modify overlapping contexts in the graph unlike the map function. The sync mechanism is analogous to the Reduce operation and runs concurrently with the update functions unlike in MapReduce. The GraphLab runtime engine determines the best order to run vertices in by relaxing the execution order requirements of the shared memory, thus enabling efficient distributed execution. However, the restriction that all vertices must be eventually run is imposed. GraphLab eliminates messages and isolates the user-defined algorithm from data movement. Due to this, the system can choose how and when to move program state. It enables the algorithm designer to distinguish between data that is shared with all neighbours and data that is shared with a particular neighbour. This is done by allowing mutable data to be associated with both vertices and edges. GraphLab does not differentiate between edge directions. The asynchronous execution behaviour depends on the number of machines and the availability of network resources. This leads to nondeterminism that can complicate algorithm design and debugging. The sequential model of the GraphLab abstraction is translated into parallel execution by allowing multiple processors to run the same loop on the same graph, removing and running different vertices simultaneously. It ensures that overlapping computation is not run simultaneously, so as to retain the semantics of sequential execution. Graphlab enforces serializabilty to overcome the above, so that every parallel execution of vertex oriented programs has a corresponding sequential executionto do this, it prevents adjacent vertex programs from running concurrently by using a fine grained locking protocol that requires sequentially grabbing locks on all neighbouring vertices[6]. 4.2 PageRank Implementation in GraphLab 1)Data Graph Every vertex (v) is related to a webpage Every edge (u,v) is related to a link (u v) Vertex data D(v) keeps the rank of the webpage R(v) Edge data Du v keeps the weight of the link (u v) 2) The update function The update function for PageRank assigns the computed weighted sum of the current ranks of neighbouring vertices as the rank of the current vertex. If the value of the current vertex changes more than a predefined threshold, only then the neighbours are listed for update making the algorithm adaptive. The update function for GraphLab takes as inputs a vertex and its scope v and Sv respectively. It returns as outputs the new version of the scope and a set of tasks T that is responsible for encoding future task executions. Update : (v,sv) (Sv,T ) Algorithm : PageRank update function Input: Vertex data R(v) from Sv Edge data {wu,v : u N[v]} from Sv Neighbour vertex data {R(u) : u N[v]} from Sv Algorithm: Rold(v) R(v) // old PageRank is saved R(v) α/n For each u N[v] do // Loop over neighbours 49 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh
R(v) R(v) + (1 α) wu,v R(u) /* If the PageRank changes sufficiently if R(v) Rold(v) > then schedule neighbors to be updated*/ return {(PageRankFun,u) : u N[v]} Output: Modified scope Sv with new R(v) 3) The sync operation The sync operation is defined as a tuple as follows:- (Key,Fold,Merge,Finalize,acc(0),τ) It consists of a unique key, three user defined functions, an initial accumulator value, and an integer representing the interval between sync operations. The Fold and Merge functions are used by the sync operation to perform a Global Synchronous Reduce. Fold aggregates vertex data. The intermediate Fold results are combined by Merge. A transformation on the final value is performed by the Finalize function and the result is stored. The Key is used by update functions to access the most recent result of the sync operation Algorithm : GraphLab Execution Model Input: Data Graph G = (V,E,D) Input: Initial task set T = {(f,v1),(g,v2),...} Initial set of syncs:(name,fold,merge,finalize,acc(0),τ) Algorithm: while T is not Empty do (f,v) RemoveNext(T )1 (T 0,Sv) f(v,sv)2 T T T 03 Run all Sync operations which are ready Output: Modified Data Graph G = (V,E,D0) Result of Sync operations 5. PREGEL 5.1 Introduction to Pregel Pregel[7] is a scalable infrastructure developed by Google to mine data contained in a variety of 50 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh graphs. In order to solve graph-based problems, programs are developed as a sequence of iterations. In every iteration, a vertex is capable of receiving messages sent to it in the previous iteration, send messages to other vertices independent of other vertices. The vertices may also modify its edge states and alter the graphs topology. Pregel programs can be scaled automatically on a cluster [8]. Vertices carry out the graph computation, while the edges are responsible for communication of these computed results between vertices. Edges do not participate in computation [9]. The input is a directed graph. The input is initialized, followed by a sequence of supersteps and terminates at the end of the algorithm. In each superstep, the same user-defined function for the given algorithm executes in parallel, with computation being done in the vertices. The termination of the algorithm depends upon every vertex voting to halt [8]. When a vertex votes to halt, it deactivates itself in the process. The framework in subsequent iterations will not execute this vertex unless reactivated by a message. When all the vertices are in an inactive state simultaneously, the algorithm terminates. 5.2 PageRank Implementation in Pregel Algorithm[10] class PageRank function ComputeAtVertex(Message msg) if(superstep() >= 1) sum=0 while(!msg->done) sum+=msg->value() msg->next() mutablevalue()=0.15/numvertces() +dampingfactor*sum if(superstep()>maxset) VoteToHalt() else n=getoutedgeiterator().size() SendMessageToAllNeighbours(GetValue()/n) The vertices store the intermediate PageRank value which is computed. On crossing the maximum number of set supersteps, no further messages are sent and the vertices vote to halt. The algorithm is generally run until a convergence is achieved.
6. COMPARISON Most advanced machine learning and data mining algorithms focus on modelling dependencies between data. But data parallel abstractions like MapReduce fails when there are computational dependencies in data. For expressing such computational dependencies, graph-parallel abstractions like GraphLab and Pregel simplify the design and implementation of graph-parallel algorithms. This is achieved by freeing the user to focus on sequential computation rather than parallel movement of data. While the MapReduce abstraction can be run iteratively, there is no mechanism provided to directly carry out iterative computation. As a result, it is not possible to express sophisticated scheduling, automatically assess termination, or leverage basic data persistence. Unlike MapReduce, GraphLab can provide sophisticated scheduling primitives that can express iterative parallel algorithms with dynamic scheduling. Table 1. Comparison MapReduce GraphLab Pregel Messages Yes Eliminated Yes File System Hadoop Distributed File System Shared Data Table Google File System Programming model Parallelism Architecture model Task/Vertex Scheduling model Application Suitability Computational Model Synchronous Asynchronous Synchronous Sparse dependency No Yes Yes Iterative With help Yes Yes of extensions [5] Sharedmemory Sharedmemory Messagepassing Data- Graph- Graph- Parallel Parallel Parallel Master- Peer-to-Peer Master- Slave Slave Pull-based Push-based Push-based Loosely- Connected Strongly- Connected Applications Strongly- Connected Applications 51 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh GraphLab gives a faster and more efficient runtime performance than MapReduce and eliminates messages. The Update Function in GraphLab is analogous to the Map in MapReduce, but unlike in MapReduce, update functions are permitted to access and modify overlapping contexts in the graph. The sync mechanism is analogous to the Reduce operation, but unlike in MapReduce, the sync mechanism runs concurrently with the update functions.[5] Pregel is more effective in handling iterative processing in comparison with MapReduce[10]. Pregel keeps vertices & edges on the machine that performs computation, uses the network to transfers only messages whereas MapReduce passes the entire state of the graph from one stage to the next, and needs to coordinate the steps of a chained MapReduce. 7. CONCLUSION Both MapReduce and Pregel are synchronous computation models while GraphLab is asynchronous. GraphLab and Pregel both support iterative algorithms and are graph parallel. MapReduce does support iterative algorithms with the help of extensions and is data parallel. In this paper we reviewed three frameworks with respect to graph based problems by comparing their mechanisms for solving the same, with GraphLab and Pregel more suited for graph based problems in comparison to MapReduce. 8. REFERENCES [1] A. Rajaraman and J. Ullman, Mining of massive datasets. New York, N.Y.: Cambridge University Press, 2012. [2] J. Alpert and N. Hajaj, 'We knew the web was big...', Google Blog, 2008. [Online]. Available: [2] http://googleblog.blogspot.in/2008/07/we-knewweb-was-big.html. [Accessed: 08- Sep- 2015]. [3] Dong, Xicheng, Ying Wang, and Huaming Liao. "Scheduling mixed real-time and non-real-time applications in mapreduce environment." Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on. IEEE, 2011. [4] Lin, Jimmy, and Chris Dyer. "Data-intensive text processing with MapReduce."Synthesis Lectures on Human Language Technologies 3.1 (2010): 1-177. [5] S. Sakr, 'Processing large-scale graph data: A guide to current technology', Ibm.com, 2013. [Online]. Available:
[5]http://www.ibm.com/developerworks/library/osgiraph/. [Accessed: 16- Sep- 2015]. [6] Hindman, Benjamin, et al. "Nexus: A common substrate for cluster computing."workshop on Hot Topics in Cloud Computing. 2009. [7] Low, Yucheng, et al. "Graphlab: A new framework for parallel machine learning." arxiv preprint arxiv:1408.2041 (2014).. [8] G. Czajkowski, 'Large-scale graph computing at Google', Google Research Blog, 2009. [Online]. Available: [8] http://google research.blogspot.in /2009/06/large-scale-graph-computing-atgoogle.html. [Accessed: 15- Sep- 2015]. [9] Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing."proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010. [10] Han, Minyang, et al. "An experimental comparison of pregel-like graph processing systems." Proceedings of the VLDB Endowment 7.12 (2014): 1047-1058. [11] Jiang, Dawei, et al. "epic: an extensible and scalable system for processing big data." Proceedings of the VLDB Endowment 7.7 (2014): 541-552. [12] 'The GraphLab Abstraction', 2015. 52 Dakshil Shah, Chetashri Bhadane, Aishwarya Sinh