Efficient Aggregation for Graph Summarization Yuanyuan Tian (University of Michigan) Richard A. Hankins (Nokia Research Center) Jignesh M. Patel (University of Michigan)
Motivation Graphs are everywhere Social networks, biological networks Graph datasets growing rapidly in size. 5,, 4,, 3,, Number of Facebook Users 2,, 1,, DB Coauthorship Jun-4 Sep-4 Dec-4 Mar-5 Jun-5 Sep-5 Dec-5 Mar-6 Jun-6 Sep-6 Dec-6 Mar-7 Jun-7 Sep-7 Dec-7 7,445 nodes, 19,971 edges Impossible to understand by mere visual inspection. Need: Graph Summarization
Existing Methods Statistical methods Limited information, hard to interpret & manipulate. Frequent subgraph mining methods Produce a large number of results. Graph partitioning methods Largely ignore node attributes. Graph compression Compact storage. Graph visualization Ben s keynote talk. MDL-based graph summarization (Nisheeth s talk) 3
Solution: Graph Aggregation Two well-defined novel graph aggregation operations: SNAP & k-snap Summarization based on user-selected node attributes and relationships. Produce summaries with controllable resolutions. Provide drill-down and roll-up abilities to navigate multi-resolution summaries. Efficient algorithms Produce meaningful summaries for real applications. Efficient and scalable for very large graphs. 4
SNAP Operation Group nodes by user-selected node attributes & relationships Nodes in each group are homogenous w.r.t. attributes and relationships The grouping with the minimum # groups friends For example: All students in the blue group have the same gender and are in the same dept Every student in the blue group has: at least one friend in the green group at least one classmate in the purple group at least one friend in the orange group at least one classmate in the orange group 5
Evaluating SNAP Operation Top-Down Approach Step 1: group nodes just based on user-selected attributes. Iterative Step: while a group breaks homogeneity requirement for relationships split the group based on its relationships with other groups C B A A A A A C B Group Split A A A B C 6
Limitations of SNAP Operation Problems with the SNAP operation Homogeneity requirement for relationships Noise and uncertainty SNAP 1% % Reality 98% 3% strong relationship weak relationship Users have no control over the resolutions of summaries SNAP operation can result in a large number of small groups k-snap operation: Relax the homogeneity requirement for relationships Let users control the resolutions of summaries Provide drill-down and roll-up abilities to navigate summaries with different resolutions. 7
k-snap Operation Users control # groups in the resulting summary: k Maintain homogeneity requirement for attributes. Relax homogeneity requirement for relationships. Assess the quality of a summary 4 4 Δ= {δ gi,g j (g i ) + δ gi,g j (g j ) } g i,g j 1 8 (95,19):95% ΔMeasure 2 1 8 2 δ gi,g j (g i )= Δ= 5% 5% (weak) P gi,g j (g i ) if p i,j.5 g i P gi,g j (g i ) otherwise Δ+= 3+4 extra participants 95% > 5% (strong) Δ+= (1-95)+(2-19) missing participants 8
Evaluating k-snap Operation Goal: Find the summary of size k with the minimum Δ value (best quality) Proved to be NP-Complete! Infeasible to produce exact k-snap summaries. Alternative: heuristics Top-Down Approach Bottom-Up Approach 9
Top-Down Approach Similar to the SNAP evaluation algorithm (coarse fine) (Difference) At each iteration, it needs to decide: which group to split? how to split the group? Heuristics: Split a group into two subgroups at each iteration Find g i with the maximumδ gi,g j (g i ) (the most contribution to Δ) Split group g i based on whether the nodes in g i connect to g j. max Split g i Δ= {δ gi,g j (g i ) + δ gi,g j (g j ) } g i,g j g i g j g j g i 1
Bottom-Up Approach Compute the SNAP summary first (fine coarse) Iteratively merge two groups until the # groups is k Which two groups to merge? Heuristics: Same attribute values Similar neighbors Similar participation ratio g i p i,k MergeDist(g i, g j )= p i,k - p k,j k i,j g j p k,j g k Merge two groups with the minimum MergeDist. 11
Experimental Evaluation Implementation C++ on top of PostgreSQL Evaluation Platform 2.8GHz P4, 2GB RAM, 25GB SATA disk, FC2 PostgreSQL: version 8.1.3, 512 MB buffer pool Evaluation Measures: Effectiveness & Efficiency Verified by the SIGMOD repeatability committee. 12
Effectiveness: DB Coauthorship DBLP Database Coauthorship Graph (7,445 nodes, 19,971 edges) Node Attributes: name (string), numpub (int), prolific (LP, P, HP) LP:[1, 5], P:[6, 2], HP:[21, -] Relationship: coauthorship SNAP Attribute: prolific Relationship: coauthorship 3,569 groups, 11,293 group relationships 13
Effectiveness: DB Coauthorship SNAP Attribute: prolific k-snap Attribute: prolific Relationship: coauthorship K=4 9.74 36 1.66 1.55 K=7 K=6 K=5 1.26 1.23 2.23 14
Effectiveness: DB Coauthorship Impact of Double-Blind Reviewing on SIGMOD 5 4 3 2 1 5% 26% 1994-2 21-27.4 2 1 91% 75% 1994-2 21-27.3.2.1 73% 32% 1994-2 21-27 Average # Publications.3.2.1 89% 73%.2.1 99% 161% 1994-2 21-27 1994-2 21-27 VLDB.161.35 SIGMOD.125.217.1 26% 139% 1994-2 21-27.1-12% 17% 1994-2 21-27.4.3.2.1 165% 11% 1994-2 21-27
k-snap: Top-Down vs. Bottom-Up Dataset: DBLP DB Coauthorship Graph Quality Measure: Δ/k Top-down beats bottom-up for small k values Execution Time Top-down is much faster than bottom-up Delta/k 15 1 Execution Time (log scale) 5 8 1 1 1 1 1 16 1 32 64 128 256 k (log scale) 512 124 248 3569 1 1 1 1 1 k(log scale) Overall, top-down is the winner! Top-Down Bottom-Up Top-Down Bottom-Up 16
Efficiency: Synthetic Graphs Dataset: Synthetic Power-Law Graphs (by GTgraph) (avg degree:5) Execution Time (sec) 9 8 7 6 5 4 3 2 1 2, 4, 6, 8, 1,, Graph Size (#nodes) k=1 k=1 k=1 17
Conclusion Database-style aggregation for graph summarization Customized summaries Controllable resolutions drill-down and roll-up abilities Meaningful summaries for real applications Efficient and scalable for very large graphs Incorporated in Periscope/GQ graph querying system Combined with other graph operations to perform complex analysis on graphs (VLDB 8 Demo)
Questions? Suggestions? Thanks! 19