Pregel: A System for Large-Scale Graph Proces sing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkwoski Google, Inc. SIGMOD July 20 Taewhi Lee
Outline Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion & Future Work 2
Outline Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion & Future Work 3
Introduction (1/2) Source: SIGMETRICS 09 Tutorial MapReduce: The Programming Model and Practice, by Jerry Zhao 4
Introduction (2/2) Many practical computing problems concern large graphs Large graph data Web graph Transportation routes Citation relationships Social networks Graph algorithms PageRank Shortest path Connected components Clustering techniques MapReduce is ill-suited for graph processing Many iterations are needed for parallel graph processing Materializations of intermediate results at every MapReduce iterati on harm performance
Single Source Shortest Path (SSSP) Problem Find shortest path from a source node to all target nodes Solution Single processor machine: Dijkstra s algorithm 6
Example: SSSP Dijkstra s Algorithm 1 0 2 3 9 4 6 2
Example: SSSP Dijkstra s Algorithm 1 0 2 3 9 4 6 2 8
Example: SSSP Dijkstra s Algorithm 8 1 14 0 2 3 9 4 6 2 9
Example: SSSP Dijkstra s Algorithm 8 1 13 0 2 3 9 4 6 2
Example: SSSP Dijkstra s Algorithm 8 1 9 0 2 3 9 4 6 2 11
Example: SSSP Dijkstra s Algorithm 8 1 9 0 2 3 9 4 6 2 12
Single Source Shortest Path (SSSP) Problem Find shortest path from a source node to all target nodes Solution Single processor machine: Dijkstra s algorithm MapReduce/Pregel: parallel breadth-first search (BFS) 13
MapReduce Execution Overview 14
Example: SSSP Parallel BFS in MapRed uce Adjacency matrix A B C D E A B 1 2 C 4 D 3 9 2 E 6 Adjacency List B C 1 A 0 2 3 9 4 6 A: (B, ), (D, ) B: (C, 1), (D, 2) C: (E, 4) D 2 E D: (B, 3), (C, 9), (E, 2) E: (A, ), (C, 6) 1
Example: SSSP Parallel BFS in MapRed uce Map input: <node ID, <dist, adj list>> B C <A, <0, <(B, ), (D, )>>> 1 <B, <inf, <(C, 1), (D, 2)>>> <C, <inf, <(E, 4)>>> A <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, ), (C, 6)>>> 0 2 3 9 4 6 Map output: <dest node ID, dist> <B, > <D, > <C, inf> <D, inf> <E, inf> <B, inf> <C, inf> <E, inf> <A, inf> <C, inf> <A, <0, <(B, ), (D, )>>> <B, <inf, <(C, 1), (D, 2)>>> D <C, <inf, <(E, 4)>>> <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, ), (C, 6)>>> 16 2 E Flushed to local disk!!
Example: SSSP Parallel BFS in MapRed uce Reduce input: <node ID, dist> B C <A, <0, <(B, ), (D, )>>> <A, inf> A <B, <inf, <(C, 1), (D, 2)>>> 1 <B, > <B, inf> 0 2 3 9 4 6 <C, <inf, <(E, 4)>>> <C, inf> <C, inf> <C, inf> <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <D, > <D, inf> D 2 E <E, <inf, <(A, ), (C, 6)>>> <E, inf> <E, inf> 1
Example: SSSP Parallel BFS in MapRed uce Reduce input: <node ID, dist> B C <A, <0, <(B, ), (D, )>>> <A, inf> A <B, <inf, <(C, 1), (D, 2)>>> 1 <B, > <B, inf> 0 2 3 9 4 6 <C, <inf, <(E, 4)>>> <C, inf> <C, inf> <C, inf> <D, <inf, <(B, 3), (C, 9), (E, 2)>>> <D, > <D, inf> D 2 E <E, <inf, <(A, ), (C, 6)>>> <E, inf> <E, inf> 18
Example: SSSP Parallel BFS in MapRed uce Reduce output: <node ID, <dist, adj list>> B = Map input for next iteration <A, <0, <(B, ), (D, )>>> <B, <, <(C, 1), (D, 2)>>> <C, <inf, <(E, 4)>>> <D, <, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, ), (C, 6)>>> Flushed to DFS!! Map output: <dest node ID, dist> C 1 A 0 2 3 9 4 6 <B, > <D, > <C, 11> <D, 12> <E, inf> <B, 8> <C, 14> <E, > <A, inf> <C, inf> <A, <0, <(B, ), (D, )>>> 2 <B, <, <(C, 1), (D, 2)>>> D E <C, <inf, <(E, 4)>>> <D, <, <(B, 3), (C, 9), (E, 2)>>> <E, <inf, <(A, ), (C, 6)>>> Flushed to local disk!! 19
Example: SSSP Parallel BFS in MapRed uce Reduce input: <node ID, dist> <A, <0, <(B, ), (D, )>>> <A, inf> B C 1 <B, <, <(C, 1), (D, 2)>>> A <B, > <B, 8> 0 2 3 9 4 6 <C, <inf, <(E, 4)>>> <C, 11> <C, 14> <C, inf> <D, <, <(B, 3), (C, 9), (E, 2)>>> <D, > <D, 12> D 2 E <E, <inf, <(A, ), (C, 6)>>> <E, inf> <E, > 20
Example: SSSP Parallel BFS in MapRed uce Reduce input: <node ID, dist> <A, <0, <(B, ), (D, )>>> <A, inf> B C 1 <B, <, <(C, 1), (D, 2)>>> A <B, > <B, 8> 0 2 3 9 4 6 <C, <inf, <(E, 4)>>> <C, 11> <C, 14> <C, inf> <D, <, <(B, 3), (C, 9), (E, 2)>>> <D, > <D, 12> D 2 E <E, <inf, <(A, ), (C, 6)>>> <E, inf> <E, > 21
Example: SSSP Parallel BFS in MapRed uce Reduce output: <node ID, <dist, adj list>> B = Map input for next iteration <A, <0, <(B, ), (D, )>>> Flushed to DFS!! 8 C 1 11 <B, <8, <(C, 1), (D, 2)>>> <C, <11, <(E, 4)>>> A <D, <, <(B, 3), (C, 9), (E, 2)>>> <E, <, <(A, ), (C, 6)>>> 0 2 3 9 4 6 the rest omitted 2 D E 22
Outline Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion & Future Work 23
Computation Model (1/3) Input Supersteps (a sequence of iterations) Output 24
Computation Model (2/3) Think like a vertex Inspired by Valiant s Bulk Synchronous Parallel model (199 0) Source: http://en.wikipedia.org/wiki/bulk_synchronous_parallel 2
Computation Model (3/3) Superstep: the vertices compute in parallel Each vertex Receives messages sent in the previous superstep Executes the same user-defined function Modifies its value or that of its outgoing edges Sends messages to other vertices (to be received in the next superstep) Mutates the topology of the graph Votes to halt if it has no further work to do Termination condition All vertices are simultaneously inactive There are no messages in transit 26
Example: SSSP Parallel BFS in Pregel 1 0 2 3 9 4 6 2 2
Example: SSSP Parallel BFS in Pregel 1 0 2 3 9 4 6 2 28
Example: SSSP Parallel BFS in Pregel 1 0 2 3 9 4 6 2 29
Example: SSSP Parallel BFS in Pregel 1 11 8 14 0 2 3 9 4 6 12 2 30
Example: SSSP Parallel BFS in Pregel 8 1 11 0 2 3 9 4 6 2 31
Example: SSSP Parallel BFS in Pregel 8 1 9 11 13 0 14 2 3 9 4 6 1 2 32
Example: SSSP Parallel BFS in Pregel 8 1 9 0 2 3 9 4 6 2 33
Example: SSSP Parallel BFS in Pregel 8 1 9 0 2 3 9 4 6 13 2 34
Example: SSSP Parallel BFS in Pregel 8 1 9 0 2 3 9 4 6 2 3
Differences from MapReduce Graph algorithms can be written as a series of chained Ma preduce invocation Pregel Keeps vertices & edges on the machine that performs computation Uses network transfers only for messages MapReduce Passes the entire state of the graph from one stage to the next Needs to coordinate the steps of a chained MapReduce 36
Outline Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion & Future Work 3
C++ API Writing a Pregel program Subclassing the predefined Vertex class Override this! in msgs out msg 38
Example: Vertex Class for SSSP 39
Outline Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion & Future Work 40
System Architecture Pregel system also uses the master/worker model Master Maintains worker Recovers faults of workers Provides Web-UI monitoring tool of job progress Worker Processes its task Communicates with the other workers Persistent data is stored as files on a distributed storage sy stem (such as GFS or BigTable) Temporary data is stored on local disk 41
Execution of a Pregel Program 1. Many copies of the program begin executing on a cluster of m achines 2. The master assigns a partition of the input to each worker Each worker loads the vertices and marks them as active 3. The master instructs each worker to perform a superstep Each worker loops through its active vertices & computes for each vertex Messages are sent asynchronously, but are delivered before the en d of the superstep This step is repeated as long as any vertices are active, or any mess ages are in transit 4. After the computation halts, the master may instruct each wor ker to save its portion of the graph 42
Fault Tolerance Checkpointing The master periodically instructs the workers to save the state of th eir partitions to persistent storage e.g., Vertex values, edge values, incoming messages Failure detection Using regular ping messages Recovery The master reassigns graph partitions to the currently available wor kers The workers all reload their partition state from most recent availabl e checkpoint 43
Outline Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion & Future Work 44
Experiments Environment H/W: A cluster of 300 multicore commodity PCs Data: binary trees, log-normal random graphs (general graphs) Naïve SSSP implementation The weight of all edges = 1 No checkpointing 4
Experiments SSSP 1 billion vertex binary tree: varying # of worker task s 46
Experiments SSSP binary trees: varying graph sizes on 800 worker task s 4
Experiments SSSP Random graphs: varying graph sizes on 800 worker tasks 48
Outline Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion & Future Work 49
Conclusion & Future Work Pregel is a scalable and fault-tolerant platform with an API that is sufficiently flexible to express arbitrary graph algorit hms Future work Relaxing the synchronicity of the model Not to wait for slower workers at inter-superstep barriers Assigning vertices to machines to minimize inter-machine commun ication Caring dense graphs in which most vertices send messages to mos t other vertices 0
Thank You! Any questions or comments?