Scalable Big Graph Processing in Map Reduce

Scalable Big Graph Processing in Map Reduce Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant College of William and Mary February 11, 2015 Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William1 and / 60Ma

Overview In this presentation, we will be introduced to methods for scalable big graph processing in MapReduce. Specifically, we will be introduced with a new class SGC which has the potential to guide the development of scalable graph processing algorithm in MapReduce. Two new graph join operators will also be introduced which will greatly enhance the capabilities of the SGC class. Finally, we will compare the performance of these three classes on several scalable graph algorithms. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William2 and / 60Ma

Computational Complexity Computational complexity theory provides a framework and a set of analysis tools for gauging the work performed by an algorithm as measured by the elementary (i.e. basic) operations it performs. The different basic steps (operations) that an algorithm typically takes are: Assignment (e.g. assigning some value to a variable) Arithmetic (e.g. addition, subtraction, multiplication, and division) Logical (e.g. comparison of two numbers) Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William3 and / 60Ma

Big-O Notation We utilize Big-O notation to define the complexity of an algorithm. Definition An algorithm is said to run in O(f(n)) time if for some numbers c and n 0, the time taken by the algorithm is at most cf(n) for all n n 0 for some constant c. This is an example of worst case analysis, which is independent of computing environment, relatively easy to perform, and providing an upper bound on the maximum number of steps an running time an algorithm must take. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William4 and / 60Ma

Big-O Complexity Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William5 and / 60Ma

Common Complexities The following table contains the complexities of common algorithms. Algorithm Data Structure Time Space Complexity Complexity Depth First Search Graph w/n nodes O(n + m) O(m) and n nodes Breadth First Search Graph w/n nodes O(n + m) O(m) and m nodes Binary Search Sorted array O(log(n)) O(1) Dijkstra s Shortest Graph w/m nodes O(n 2 ) O(n) Path (unsorted array) and n nodes Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William6 and / 60Ma

Algorithm Classes in Map Reduce There are currently two main algorithm classes in the MapReduce paradigm: The MapReduce Class (MRC). The Minimal MapReduce Class (MMC). These classes are defined in terms of disk usage, memory usage, communication cost, CPU cost, and number of map reduce rounds. There is also the popular Parallel Random-Access Machine (PRAM) model, against which performance studies were run. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William7 and / 60Ma

Map Reduce Class Let S be the set of objects in the problem and let t be the number of machines in the system. Fix a ɛ > 0, a MapReduce algorithm in MRC should have the following properties: Each Machine Total Disk: O( S 1 ɛ ) O( S 2 2ɛ ) Memory: O( S 1 ɛ ) O( S 2 2ɛ ) Communication: O( S 1 1ɛ )/per round O( S 2 2ɛ ) CPU: O( Tseq t ) Number of Rounds: O(1) T seq is the time to solve the same problem on a single sequential machine Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William8 and / 60Ma

Minimal Map Reduce Class Let S be the set of objects in the problem and let t be the number of machines in the system. Fix a ɛ > 0, a MapReduce algorithm in MRC should have the following properties: Disk: Memory: Each Machine O( S t ) O( S t ) O( S t Total O( S ) O( S ) Communication: )/per round O( S ) CPU: O(poly( S ))/per round Number of Rounds: O(log i S ), i 0 Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William9 and / 60Ma

Parallel Random Access Machine Parallel Random Access Machine (PRAM) is an algorithm for creating a model of parallel computation. It is an extension of the RAM model of sequential computation. In this model, there are p processors connected to a single shared memory and each processor has a unique index 1 i p called the processor id. A single program is executed in single-instruction stream, multiple-data stream fashion. Meaning that each instruction is carried out by all processors simultaneously and requires unit time, regardless of the number of processors. Finally, each processor has a private flag that controls whether it is active in the execution of an instruction. Inactive processors do no participate in the execution of instructions, except for instructions to reset the flag. We will later compare the performance of this algorithm to MRC, MMC, and SGC. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William10 and / 60Ma

MRC VS MMC MRC defines the basic requirements for an algorithm to execute in MapReduce, whereas MMC requires several aspects to achieve optimality simultaneously in a MapReduce algorithm. We will begin by analyzing the problems involved in MRC and MMC in graph processing. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William11 and / 60Ma

Defining a Graph Let s consider a graph G = (V, E), where V represents the set of vertices (nodes) and E represents the set of edges (arcs). Further, let n = V be the number of nodes and m = E be the number of edges. A graph can be either directed or undirected, cyclic or acyclic, connected or unconnected. We can represent a graph in either a Adjacency Matrix Adjacency List Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William12 and / 60Ma

Adjacency Matrix Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William13 and / 60Ma

Adjacency List Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William14 and / 60Ma

Scalable Graph Processing in MMC For a graph G(V, E), a common graph operation is to exchange data among all adjacent nodes (nodes that share a common edge) in the graph. The memory constraint in MMC requires that all edges/nodes are distributed evenly among all machines in the system. This can be formalized as: Let E i,j be the set of edges (u, v) in G such that u is in machine i and v is in machine j. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William15 and / 60Ma

Scalable Graph Processing in MMC The communication constraint in MMC can be formalized as follows: max ( 1 i t 1 j t,j i E i,j ) O( (n + m) ) t where once again E(i, j) is the set of edges (u, v) G and u is in machin i and v is in machine j. In order to achieve this inequality, we must minimize the maximum, i.e. min max ( E i,j ). 1 i t 1 j t,j i However, this problem is actually NP -Hard, meaning that it is at least as hard as the hardest problems in NP. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William16 and / 60Ma

Scalable Graph Processing in MMC In addition to being NP -Hard, the optimal solution to max ( 1 i t 1 j t,j i E i,j ) O( (n + m) ) t is successfully, computed, we can t guarantee that the inequality O( (n+m) t ) since it might be as large as O(n + m). Therefore, MMC is not a suitable class for scalable graph processing. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William17 and / 60Ma

Scalable Graph Processing in MRC MRC has few constraints than MMC as it simply defines the basic conditions that a MapReduce algorithm should satisfy. Thus a graph algorithm in MapReduce is not an exception. Like MMC, however, we can define a better class to handle Scalable Graph Processing Given a graph G(V,E) with n nodes and m edges, assume that m n 1+c, an MRC graph define a class based on MRC for graph processing in MapReduce, in which a MapReduce algorithm has the following properties: Each Machine Total Disk: O(n 1+c 2 ) O(m 1+c 2 ) Memory: O(n 1+c 2 ) O(m 1+c 2 ) Communication: O(n 1+c 2 )/per round O(m 1+c 2 ) CPU: O(poly(m))/per round Number of Rounds: O(1) Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William18 and / 60Ma

Scalable Graph Processing in MRC This class has a good property in that the algorithm runs in constant rounds. However, the memory constraint can cause difficulty as it is large for even a dense graph. (Note: Dense graphs are generally easier to solve than sparse graphs.) Furthermore, if the memory of each machine cannot hold O(n 1+c 2 ), then the algorithm will always fail. Thus, the class is not scalable and can t handle large n. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William19 and / 60Ma

Scalable Graph Processing Class We will now formulate a new algorithm class which counters this deficiency. First, we will weaken the bounds on the communication cost per machine from O( m+n t ) to Õ( m t, D(G, t)). This is done to account for the fact that graphs, especially large graphs, can have a skewed degree distribution. This is seen in graphs such as social networks, which often have several nodes with a large number of degrees (subscribers, followers, etc.) as opposed to lower-level users with only a few connections. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William20 and / 60Ma

Skewed Degree Distribution Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, in Map Presented Reduceby Megan Bryant February (College 11, 2015 of William 21and / 60Ma

Scalable Graph Processing Class Suppose the nodes are uniformly distributed among all machines, denote by V i the set of nodes stored in machine i for 1 i t, and let d j be the degree of node v j in the input graph, Õ( m t, D(G, t)) is defined as: Õ( m, D(G, t)) =O( max t ( d j )) 1 i t v j V i D(G, t) = t1 t 2 d 2 j v j V Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William22 and / 60Ma

Scalable Graph Processing Class This leads us to the following lemma, the proof of which has been omitted. Lemma Lemma 3.1: Let x i (1 i q) be the communication cost upper bound for machine i, i.e., x i = v j V i d j, the expected value of x i, E(x i ) = 2m t, and the variance of x i, V ar(x i ) = D(G, t). The important thing that we want to note here is that the variance of the degree distribution of G, denoted V ar(g) is ( (d j 2m n )2 /n = (n d 2 j 4m2 )/n 2. v j V v j V For fixed t, n, and m values, minimizing D(G, t) is equivalent to minimizing V ar(g). In other words, the variance of communication cost for each machine is minimized if all nodes in the graph have the same Lu Qin, degree. Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William23 and / 60Ma

Scalable Graph Processing Class Thus, we define the Scalable Graph Processing Class (SGC) as follows. Each Machine Total Disk: O( m+n 2 ) O(m + n) Memory: O(1) O(t) Communication: Õ( m t, D(G, t)) /per round O(m + n) CPU: Õ( m t, D(G, t)) /per round Number of Rounds: O(log(n)) Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William24 and / 60Ma

Comparison Between Classes We examine the upper bounds of the three classes to see how the running times of SGC compare. MRC MMC SGC Disk/machine O(n 1+ c 2 ) O( n+m t ) O( n+m t ) Disk/total O(m 1+ c 2 ) O(n + m) O(n + m) Memory/machine O(n 1+ c 2 ) O( n+m t ) O(1) Memory/total O(m 1+ c 2 ) O(n + m) O(t) Communication/machine O(n 1+ c 2 ) O(n + mt) Õ( m t, D(G, t)) Communication/total O(m 1+ c 2 ) O(n + m) O(n + m) CPU/machine O(poly(m)) O( Tseq t ) Õ( m t, D(G, t)) CPU/total O(poly(m)) O(T seq ) O(n + m) Number of rounds O(1) O(1) O(log(n)) Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William25 and / 60Ma

Comparison Between Classes We see that even though SGC requires each machine to use constant memory. Meaning, if the total memory of the system is smaller than the input data, the algorithm can still be processed successfully. This is an even stronger constraint than that defined in MMC. Given the constraints on memory, communication, and CPU, it is nearly impossible for a wide range of graph algorithms to be processed in constant rounds in MapReduce. Thus, we relax the O(1) rounds defined in MMC to O(log(n)) rounds. Since Ω(log(n)) is the processing time lower bound for a large number of parallel graph algorithms in the parallel random-access machines, it is practical for the MapReduce framework as evidenced by our experiments. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William26 and / 60Ma

Graph Operators in SGC In addition to the normal set of graph operators, such as union, intersection, etc., we have introduced two graph operators in SGC, namely, NE join, and EN join, using which a large range of graph problems can be designed. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William28 and / 60Ma

Graph Operators in SGC We assume that a graph G(V, E) is stored in a distributed file system as a node table V and an edge table E. Each node in the table has a unique id and some other information such as label and keywords. Each edge in the table has id 1, id 2 defining the source and target node ids of the edge, and some other information such as weight and label. We use the node id to represent the node if it is obvious. G can be either directed or undirected. For an undirected graph, each edge is stored as two edges (id 1, id 2 ) and (id 2, id 1 ). Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William29 and / 60Ma

Graph Operators in SGC Before we go any further, let s examine the natural join operation,, acting on two sets of data. Here we see a graphical representation of Employee Dept. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William30 and / 60Ma

NE Join An NE join aims to propagate the information on nodes into edges. For each edge (v i, v j ) E, an NE join outputs an edge (v i, v j, F (v i )) (or (v i, v j, F (vj))) where F (v i ) (or F (v j )) is a set of functions operated on v i (or v j ) in the node table V. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William31 and / 60Ma

NE Join Given node table V i, & edge table E j, an NE join of V i & E j is represented in SQL as: select id 1, id 2, f 1 (c 1 ) as p 1, f 2 (c 2 ) as p 2, from V i as V NE join E j as E on V.id = E.id where cond(c) count cond (c ) as cnt With the following definitions, c, c, a subset of fields in the two tables V i and E j c 1, c 2 a subset of fields in the two tables V i and E j f k a function operated on the fields c k cond a fucntion that retrusn true or false defined on the fields in c. cond a fucntion that retrusn true or false defined on the fields in c. id can be either id 1 or id 2. count counts the number of trues in cond (c ), assigns it to cnt. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William32 and / 60Ma

EN Join An EN join aims to aggregate the information on edges into nodes. For each node v i V, an EN join outputs a node (v i, G(adj(v i ))) where adj(v i ) = (v i, v j ) E, and G is a set of decomposable aggregate functions on the edge set adj(v i ). A decomposable aggregate function g k is defined as decomposable if for any dataset s, and any two subsets of s, s 1 and s 2, with s 1 s 2 = and s 1 s 2 = s, g k (s) can be computed using g k (s 1 ) and g k (s 2 ). Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William33 and / 60Ma

EN Join EN join can be defined in SQL form as select id, g 1 (c 1 ) as p 1, g 2 (c 2 ) as p 2, from V i as V EN join Ej as E on V.id = E.id where cond(c) group by id count cond (c ) as cnt With the following definitions, c, c, a subset of fields in the two tables V i and E j c 1, c 2 a subset of fields in the two tables V i and E j id either id 1 or id 2 count cond (c ) as cnt g k decomposable aggregate function operated on the fields in c k by grouping the results using node id Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William34 and / 60Ma

Basic Graph Algorithms The combination of NE join and EN join can solve a wide range of graph problems in SGC. In this section, we introduce some basic graph algorithms: PageRank Breadth First Search Graph Keyword Search We will use MRC, MMC, and SGC versions of these algorithms for performance testing, which will be covered later. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William35 and / 60Ma

Page Rank PageRank is a key graph operation which computes the rank of each node based on the links (directed edges) among them. Given a directed graph G(V, E), and a page x with inlinks t 1,..., t n, the page rank of x can be calculated iteratively as follows with the following definitions ( ) 1 P R(x) = α + (1 α) V C(t) out-degree of t α probability of random jump V total number of nodes n i=1 P R(t i ) C(t i ) Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William36 and / 60Ma

Page Rank Algorithm Graphical overview of the Page Rank algorithm. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William37 and / 60Ma

Page Rank in MapReduce Graphical overview of the Page Rank in MapReduce. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William38 and / 60Ma

Breadth First Search Breadth First Search (BFS) is a fundamental graph operation. Given an undirected graph G(V, E), and a source node s, a BFS computes for every node v V the shortest distance (i.e., the minimum number of hops) from s to v in G. Define: b is reachable from a if b is on adjacency list of a DistanceTo(s) =0 For all nodes p reachable from s, DistanceTo(p)= 1 For all nodes n reachable from some other set of nodes M, DistanceTo(n)= 1 + min(distanceto(m), m M) Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William39 and / 60Ma

Breadth First Search Graphical overview of the Breadth First Search algorithm. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William40 and / 60Ma

Graph Key Word Search We now investigate a more complex algorithm, namely, keyword search in an undirected graph G(V, E). Suppose for each v V, t(v) is the text information included in v. Given a keyword query with Q = {k 1, k 2,..., k l } set of l keywords (r, {(p 1, d(r, p 1 )), (p 2, d(r, p 2 )), set of rooted trees..., (p l, d(r, p l ))}) r the root node p i node that contains keyword k i in t(p i ) d(r, p i ) shortest distance from r to p i in G for 1 i l Each answer is uniquely determined by its root node r and rmax is the maximum distance allowed from s to a keyword node in an answer, i.e., d(r, p i ) rmax for 1 i l. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William41 and / 60Ma

Connected Component Given an undirected graph G(V, E) with n nodes and m edges, a Connected Component (CC) is a maximal set of nodes that can reach each other through paths in G. Computing all CCs of G is a fundamental graph problem and can be solved efficiently on a sequential machine using O(n + m) time. However, it is non-trivial to solve the problem in MapReduce. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William42 and / 60Ma

Existing Algorithms We present three algorithms for Connected Components computation in MapReduce to compare the success of CC in SGC. HashToMin HashGToMin PRAM-Simulation Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William43 and / 60Ma

HashToMin HashToMin and HashGToMin are two MapReduce algorithms with a similar idea to use the smallest node in each CC as the representative of the CC, assuming that there is a total order among all nodes in G. The HashToMin algorithm finishes in O(log(n)) rounds, with O(log(n)(m + n)) total communication cost in each round. The algorithm can be optimized to use O(1) memory on each machine using secondary sort in MapReduce. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William44 and / 60Ma

HashGToMin The HashGToMin algorithm finishes in Õ(log(n)). Meaning, it is expected to finish in O(log(n))) rounds, with O(m + n) total communication cost in each round. However, it needs O(n) memory for a single machine to hold a whole CC in memory. Thus, HashGToMin is not suitable to handle a graph with large n. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William45 and / 60Ma

PRAM Simulation PRAM-Simulation is to simulate the algorithm in the Parallel Random Access Machine (PRAM) model in MapReduce using simulation. The PRAM model allows multiple processors to compute in parallel using a shared memory. A theoretical result shows that an CREW PRAM algorithm in O(t) time can be simulated in MapReduce in O(t) rounds. For the CC computation problem, in the literature, the best result in computes CCs in O(log(n)) time. However, it needs to compute the 2-hop node pairs which requires O(n2) communication cost in the worst case in each round. Thus, the simulation algorithm is impractical. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William46 and / 60Ma

Connected Component in SGC We introduce our algorithm to compute CCs in SGC. Conceptually, the algorithm shares similar ideas with most deterministic O(log(n)) PRAM algorithms, but it is non-trivial. Our algorithm maintains a forest using a parent pointer p(v) for each v V. Each rooted tree in the forest represents a partial CC. A singleton is a tree with one node, and a star is a tree of height 1. A tree is an isolated tree if there are no edges in E that connect the tree to another tree. The forest is iteratively updated using two operations: hooking and pointer jumping. Hooking merges several trees into a larger tree, and pointer jumping changes the parent of each node to its grandparent in each tree. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William47 and / 60Ma

Comparison We can now compare the running times of these algorithms. We omit PRAM since it was impractical. Note that the CC algorithm in SGC class has the best bounds in each category. This indicates the significant improvement that SGC represents for scalable big graph processing. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William48 and / 60Ma

Minimum Spanning Forest Given a weighted undirected graph G(V, E) of n nodes and m edges, with each edge (u, v) E assigned a weight w((u, v)), a Minimum Spanning Forest (MSF) is a spanning forest of G with the minimum total edge weight. We also use (u, v, w((u, v))) to denote an edge. Although MSF can be efficiently computed on a sequential machine using O(m + nlog(n)) time, it is non-trivial to solve the algorithm in MapReduce. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William49 and / 60Ma

Minimum Spanning Forest The following is an example of a Minimum Spanning Tree. A forest is made up of many trees. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William50 and / 60Ma

MSF Algorithm in SGC Suppose there is a total order among all edges as follows. For any two edges e 1 = (u 1, v 1, w 1 ) and e 2 = (u 2, v 2, w 2 ), e 1 < e 2 iff one of the following conditions holds: 1 w 1 < w 2 2 w 1 = w 2 and min(u 1, v 1 ) < min(u 2, v 2 ) 3 w 1 = w 2 and min(u 1, v 1 ) = min(u 2, v 2 ), and max(u 1, v 1 ) < max(u 2, v 2 ) Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William51 and / 60Ma

MSF Comparisons The comparison of two existing algorithms OneRoundMSF, MultiRoundMSF, and our algorithm MSF is shown below in terms of memory consumption per machine, total communication cost per round, and the number of rounds. As we will show in our performance testing, the high memory requirement of OneRoundMSF and MultiRoundMSF becomes the bottleneck for the algorithms to achieve high scalability when handling graphs with large n. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William52 and / 60Ma

Performance Testing We tested the performance of the aforementioned algorithms on a cluster of 17 computing nodes, including one master node and 16 slave nodes running, each of which has four Intel Xeon 2.4GHz CPUs and 15GB RAM running 64-bit Ubuntu Linux. We implement all algorithms using Hadoop (version 1.2.1) with Java 1.6. We allow each node to run three mappers and three reducers concurrently Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William53 and / 60Ma

Data Sets We use two web-scale graphs Twitter-2010 and Friendster with different graph characteristics for testing. Twitter-2010 contains 41,652,230 nodes and 1,468,365,182 edges with an average degree of 71. The maximum degree is 3,081,112 and the diameter of Twitter-2010 is around 24. Friendster contains 65,608,366 nodes and 1,806,067,135 edges with an average degree of 55. The maximum degree is 5,214 and the diameter of Friendster is around 32. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William54 and / 60Ma

Algorithms Besides the five algorithms PageRank (Algorithm 1), BFS (Algorithm 2), KWS (Algorithm 3), CC (Algorithm 4), and MSF (Algorithm 5), we also implement the algorithms for PageRank, BFS, and graph keyword search using the join operations supported by Pig on Hadoop, denoted PageRank-Pig, BFS-Pig and KWS-Pig respectively. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William55 and / 60Ma

PageRank Algorithm Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William56 and / 60Ma

BFS Algorithm Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William57 and / 60Ma

CC Algorithm Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William58 and / 60Ma

MSF Algorithm Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William59 and / 60Ma

Conclusions In this paper, we studied scalable big graph processing in MapReduce. We reviewed previous MapReduce classes, and propose a new class SGC to guide the development of scalable graph processing algorithms in MapReduce. We introduce two graph join operators using which a large range of graph algorithms can be designed in SGC. Especially, for two fundamental graph algorithms CC computation and MSF computation, we improve the state-of-the-art algorithms both in theory and practice. We conducted extensive performance studies using real web-scale graphs to show the high scalability achieved for our algorithms in SGC. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Scalable Chengqi Big Graph Zhang, Processing Xuemin Lin, Map Presented Reduceby Megan Bryant February (College 11, 2015 of William60 and / 60Ma