PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) SIGMOD 2010 Presented by : Xiu Zhang 20161011

Motivations Computation Models System Architecture False Toleration Applications Experiments

MOTIVATION 3

Motivation Large Graphs Computation is needed: Social Media Transportation

Motivation Documents=vertices Links=edges Web graph

Graph Algorithms Pattern Matching Search through the entire graph Identify similar components Traversals Define a specific start point Iteratively explore the graph Global measurements Compute one value for graph, based on all its vertices or edges

Challenges for Graph Algorithms Poor Locality of memory access Very little computation work required per vertex, however iterate many times Shortest Path Changing degree of parallelism over course of execution Connect Component Analysis

Possible Solutions Custom distributed frame work for each alg. Existing distributed computing platforms MapReduce unnecessarily slow, hard to implement Single-computer graph algorithm libraries Scale limitation Existing parallel graph systems Fault tolerance Parallel BGL and CGMgraph

Inspired by Valiant s Bulk Synchronous Parallel (BSP) mode Vertex centric computation

COMPUTATION MODEL 10

Computation Model(BSP) asynchronization Source: http://en.wikipedia.org/wiki/bulk_synchronous_parallel 11

: Message Passing Model Vertex: A unique identifier A modifiable, user defined value Edge: Source vertex and Target vertex identifiers A modifiable, user defined value

Basic Organization Supersteps: Iterations Invoke user defined function for each vertex Read messages sent to V in superstep S-1 Send messages that will be received in S+1 Modify the state of V and the outgoing edges Make topology changes Introduce/Delete/Modify edges(vertices) Votes to halt if no further work to do

State machine for a vertex Termination Condition All vertices are simultaneously inactive There are no messages in transit

Example Single Source Shortest Path Find shortest path from a source node to all target nodes Example taken from talk by Taewhi Lee,2010 http://zhenxiao.com/read/.ppt 15

Example: SSSP Parallel BFS in 1 10 0 2 3 9 4 6 Inactive Vertex Active Vertex 5 2 7 x x Edge weight Message 16

Example: SSSP Parallel BFS in 10 10 1 0 2 3 9 4 6 Inactive Vertex Active Vertex 5 7 x x Edge weight Message 5 2 17

Example: SSSP Parallel BFS in 10 1 10 0 2 3 9 4 6 Inactive Vertex Active Vertex x Edge weight 5 7 x Message 5 2 18

Example: SSSP Parallel BFS in 10 1 11 10 8 14 0 2 3 9 4 6 Inactive Vertex Active Vertex x Edge weight 5 12 7 x Message 5 2 7 19

Example: SSSP Parallel BFS in 8 1 11 10 0 2 3 9 4 6 Inactive Vertex Active Vertex x Edge weight 5 7 x Message 5 2 7 20

Example: SSSP Parallel BFS in 8 1 9 10 0 2 3 9 4 6 Inactive Vertex Active Vertex x Edge weight 5 7 x Message 5 2 7 21

Example: SSSP Parallel BFS in 8 1 9 10 0 2 3 9 4 6 Inactive Vertex Active Vertex x Edge weight 5 7 13 x Message 5 2 7 22

Example: SSSP Parallel BFS in 8 1 9 10 0 2 3 9 4 6 Inactive Vertex Active Vertex x Edge weight 5 7 x Message 5 2 7 23

SYSTEM ARCHITECTURE 24

System Architecture system uses the master/worker model Master Coordinates workers Recovers faults of workers Worker Processes its task Communicates with the other workers Persistent data is in distributed storage system Temporary data is stored on local disk 25

Execution 26

Execution 27

Execution 28

Execution 29

Execution 30

FALSE TOLERANCE 31

Fault Tolerance Checkpointing The master periodically instructs the workers to save the state of their partitions to persistent storage e.g., Vertex values, edge values, incoming messages Failure detection Master uses regular ping messages to detect worker failures 32

Fault Tolerance Recovery The master reassigns graph partitions to the currently available workers The workers all reload their partition states from most recent available checkpoint 33

APPLICATIONS 34

PageRank the importance of a document the number of references to it the importance of the source documents themselves A = A given page T 1. T n = Pages that point to page A (citations) d = Damping factor between 0 and 1 (usually kept as 0.85) C(T) = number of links going out of T PR(A) = the PageRank of page A PR( A) (1 d) d PR( T1 ) ( C( T ) 1 PR( T2 ) C( T ) 2... PR( Tn ) ) C( T ) n 35

PageRank Courtesy: Wikipedia 36

PageRank Iterative loop till convergence Initial value of PageRank of all pages = 1.0; While ( sum of PageRank of all pages numpages > epsilon) { for each Page Pi in list { PageRank(Pi) = (1-d); for each page Pj linking to page Pi { PageRank(Pi) += d (PageRank(Pj)/numOutLinks(Pj)); } } } 37

Page Rank In

EXPERIMENTS 39

Experiments: (Shortest Paths) 1 billion vertex binary tree: varying number of worker tasks 40

Experiments: binary trees: varying graph sizes on 800 worker tasks 41

Experiments Log-normal random graphs, mean out-degree 127.1 (thus over 127 billion edges in the largest case): varying graph sizes on 800 worker tasks 42

Conclusion Distributed system for large scale graph processing Think like a vertex computation model (intuitive API) 43

Limitations Inefficient if different regions of the graph converge at different speed Slowest machine Dense Graphs

THANK YOU ANY QUESTIONS?