Apache Giraph: Facebook-scale graph processing infrastructure 3/31/2014 Avery Ching, Facebook GDM
Motivation
Apache Giraph Inspired by Google s Pregel but runs on Hadoop Think like a vertex Maximum value vertex example Processor 1 5 5 5 5 Processor 2 1 1 5 5 5 2 2 2 5 5 Time
Giraph on Hadoop / Yarn Giraph MapReduce YARN Hadoop Hadoop Hadoop Hadoop 0.20.x 0.20.203 1.x 2.0.x
Page rank in Giraph Calculate updated page rank value from neighbors Send page rank value to neighbors for 30 iterations public class PageRankComputation extends BasicComputation<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute(vertex<longwritable, DoubleWritable, FloatWritable> vertex, Iterable<DoubleWritable> messages) { if (getsuperstep() >= 1) { double sum = 0; for (DoubleWritable message : messages) { sum += message.get(); } vertex.getvalue().set(doublewritable((0.15d / gettotalnumvertices()) + 0.85d * sum); } if (getsuperstep() < 30) { sentmsgtoalledges(new DoubleWritable(getVertexValue().get() / getnumoutedges())); } else { votetohalt(); } } }
Apache Giraph data flow Loading the graph Compute / Iterate Storing the graph Master Input format Split 0 Split 1 Split 2 Split 3 Worker 0 Worker 1 Load/ Send Graph Load/ Send Graph Master In-memory graph Part 0 Part 1 Part 2 Part 3 Worker 0 Worker 1 Compute/ Send Messages Compute/ Send Messages Worker 0 Worker 1 Part 0 Part 1 Part 2 Part 3 Output format Part 0 Part 1 Part 2 Part 3 Send stats/iterate!
Use case: k-means clustering Cluster input vectors into k clusters Assign each input vector to the closest centroid Update centroid locations based on assignments Random centroid location Assignment to centroid Update centroids c0 c2 c0 c2 c0 c2 c0 c2 c1 c1 c1 c1
k-means in Giraph Partitioning the problem Input vectors vertices c0 c2 Partitioned across machines Centroids aggregators Shared data across all machines c1 Worker 0 Worker 1 Problem solved...right? c0 c2 c0 c2 c1 c1
Problem 1: Massive dimensions Cluster Facebook members by friendships? 1 billion members (dimensions) k clusters Each worker sending to the master a maximum of 1B * (2 bytes - max 5k friends) * k = 2 * k GB Master receives up to 2 * k * workers GB Saturated network link OOM
Sharded aggregators Master handles all aggregators final agg 0 Aggregators sharded to workers final agg 0 master final agg 1 master final agg 1 final agg 2 final agg 2 worker 0 partial agg 0 partial agg 1 partial agg 2 final agg 0 final agg 1 final agg 2 worker 0 partial agg 0 partial agg 1 partial agg 2 final agg 0 final agg 1 final agg 2 worker 1 partial agg 0 partial agg 1 partial agg 2 final agg 0 final agg 1 final agg 2 worker 1 partial agg 0 partial agg 1 partial agg 2 final agg 1 final agg 0 final agg 2 worker 2 partial agg 0 partial agg 1 partial agg 2 final agg 0 final agg 1 final agg 2 worker 2 partial agg 0 partial agg 1 partial agg 2 final agg 2 final agg 0 final agg 1 Share aggregator load across workers Future work - tree-based optimizations (not yet a problem)
Problem 2: Edge cut metric Clusters should reduce the number of cut edges Two phases Send all out edges your cluster id Aggregate edges with different cluster ids Calculate no more than once an hour?
Master computation Serial computation on master Communicates to workers via aggregators Added to Giraph by Stanford GPS team Master Worker 0 k-means k-means start cut end cut k-means Worker 1 k-means k-means start cut end cut k-means Time
Problem 3: More phases, more Add a stage to initialize the centroids Add random input vectors to centroids Add a few random friends Two phases Randomly sample input vertices to add Send messages to a few random neighbors c0 c2 c3
Problem 3: (continued) Cannot easily support different messages, combiners Vertex compute code getting messy if (phase == INITIALIZE_SELF) // Randomly add to centroid else if (phase == INITIALIZE_FRIEND) // Add my vector to centroid if a friend selected me else if (phase == K_MEANS) // Do k-means else if (phase == START_EDGE_CUT)... c0 c2 c3
Composable computation Decouple vertex from computation Master sets the computation, combiner classes Reusable and composable Computation Add random centroid / random friends Add to centroid K-means Start edge cut End edge cut In message Null Centroid message Null Null Cluster Out message Centroid message Null Null Cluster Null Combiner N/A N/A N/A Cluster combiner N/A
Composable computation (cont) Balanced Label Propagation compute candidates to move to partitions probabilistically move vertices Continue if halting condition not met (i.e. < n vertices moved?)
Continue if halting condition not met (i.e. < n vertices changed exemplars?) Composable computation (cont) Balanced Label Propagation compute candidates to move to partitions probabilistically move vertices Continue if halting condition not met (i.e. < n vertices moved?) Affinity Propagation calculate and send responsibilities calculate and send availabilities update exemplars
Faster than Hive? Application Graph Size CPU Time Speedup Elapsed Time Speedup Page rank (single iteration) 400B+ edges 26x 120x Friends of friends score 71B+ edges 12.5x 48x
Apache Giraph scalability Scalability of workers (200B edges) Scalability of edges (50 workers) 500 500 Seconds 375 250 125 Seconds 375 250 125 0 50 100 150 200 250 300 # of Workers Giraph Ideal 0 1E+09 7E+10 1E+11 2E+11 # of Edges Giraph Ideal
Page rank on 200 machines with 1 trillion (1,000,000,000,000) edges <4 minutes / iteration! * Results from 6/30/2013 with one-to-all messaging + request processing improvements
Why balanced partitioning Random partitioning == good balance BUT ignores entity affinity 0 1 6 3 4 5 7 8 9 2 11
Balanced partitioning application Results from one service: Cache hit rate grew from 70% to 85%, bandwidth cut in 1/2 0 3 6 9 1 4 7 2 5 8 11
Balanced label propagation results * Loosely based on Ugander and Backstrom. Balanced label propagation for partitioning massive graphs, WSDM '13
Avoiding out-of-core Example: Mutual friends calculation between neighbors 1. Send your friends a list of your friends C:{D} D:{C} A E:{} B 2. Intersect with your friend list 1.23B (as of 1/2014) A:{D} D:{A,E} E:{D} C E B:{} C:{D} D:{C} 200+ average friends (2011 S1) 8-byte ids (longs) = 394 TB / 100 GB machines D A:{C} C:{A,E} E:{C} 3,940 machines (not including the graph)
Superstep splitting Subsets of sources/destinations edges per superstep * Currently manual - future work automatic! Sources: A (on), B (off) Destinations: A (on), B (off) Sources: A (on), B (off) Destinations: A (off), B (on) Sources: A (off), B (on) Destinations: A (on), B (off) Sources: A (off), B (on) Destinations: A (off), B (on)
Debugging with GiraphicJam
Giraph in Production Over 1.5 years in production Over 100 jobs processed a week 30+ applications in our internal application repository Sample production job - 700B+ edges
Future work Investigate alternative computing models Giraph++ (IBM research) Giraphx (University at Buffalo, SUNY) Performance Lower the barrier to entry Applications
Our team Pavan Athivarapu Avery Ching Maja Kabiljo Greg Malewicz Sambavi Muthukrishnan