Efficient Aggregation for Graph Summarization

Similar documents
Data Mining for Business Intelligence: from Relational to Graph Representation

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

Epilog: Further Topics

Join Processing for Flash SSDs: Remembering Past Lessons

Iterative Graph Summarization based on Grouping

Probabilistic Graph Summarization

Efficient Subgraph Matching by Postponing Cartesian Products

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

HYRISE In-Memory Storage Engine

Keyword Search on Form Results

Big Data Management and NoSQL Databases

From Think Like a Vertex to Think Like a Graph. Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

Graph Cube: On Warehousing and OLAP Multidimensional Networks

Compressing Relations and Indexes Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft 1 Database Talk Speaker : Jonathan Goldstein Microsoft Researc

Mining Summaries for Knowledge Graph Search

A Framework for Efficient Fingerprint Identification using a Minutiae Tree

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Enhanced Performance of Database by Automated Self-Tuned Systems

Comparative Study of Subspace Clustering Algorithms

Model-based Database Systems

DBMS Data Loading: An Analysis on Modern Hardware. Adam Dziedzic, Manos Karpathiotakis*, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

Finding k-dominant Skylines in High Dimensional Space

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Search in Social Networks with Access Control

Course Outline. Writing Reports with Report Builder and SSRS Level 1 Course 55123: 2 days Instructor Led. About this course

Efficient Top-k Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling with Application to Network Structure Prediction

CSE 701: LARGE-SCALE GRAPH MINING. A. Erdem Sariyuce

GraphQ: Graph Query Processing with Abstraction Refinement -- Scalable and Programmable Analytics over Very Large Graphs on a Single PC

Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

Mobility Data Management & Exploration

Improving the Performance of OLAP Queries Using Families of Statistics Trees

Incognito: Efficient Full Domain K Anonymity

Dynamic and Historical Shortest-Path Distance Queries on Large Evolving Networks by Pruned Landmark Labeling

Practical Methods for Constructing Suffix Trees. Advisor: Jignesh M. Patel Yuanyuan Tian

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

I/O Efficient Algorithms for Exact Distance Queries on Disk- Resident Dynamic Graphs

CS 664 Segmentation. Daniel Huttenlocher

Part A: MapReduce. Introduction Model Implementation issues

Data Mining for Knowledge Management. Association Rules

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Novel Hybrid k-d-apriori Algorithm for Web Usage Mining

gsketch: On Query Estimation in Graph Streams

Speeding Up Data Science: From a Data Management Perspective

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

SSD-based Information Retrieval Systems

Column Stores vs. Row Stores How Different Are They Really?

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

Construction of Minimum-Weight Spanners Mikkel Sigurd Martin Zachariasen

ECE 669 Parallel Computer Architecture

PROJECT PROPOSALS: COMMUNITY DETECTION AND ENTITY RESOLUTION. Donatella Firmani

Succinct Data Structures: Theory and Practice

TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets

Mining Frequent Patterns without Candidate Generation

Jignesh M. Patel. Blog:

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Foster B-Trees. Lucas Lersch. M. Sc. Caetano Sauer Advisor

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

SSD-based Information Retrieval Systems

Oh Pott, Oh Pott! or how to detect community structure in complex networks

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

Protecting Web Servers from Web Robot Traffic

Tree-Pattern Queries on a Lightweight XML Processor

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Data cube and OLAP. Selecting Views to Materialize. Aggregation view lattice. Selecting views to materialize. Limitations of static approach.

Graph Data Management

Evolution of Database Systems

SociaLite: A Python-Integrated Query Language for

IMAGE MATCHING - ALOK TALEKAR - SAIRAM SUNDARESAN 11/23/2010 1

Parallelizing String Similarity Join Algorithms

Graph-Based Synopses for Relational Data. Alkis Polyzotis (UC Santa Cruz)

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.

Towards Declarative and Efficient Querying on Protein Structures

Network visualization techniques and evaluation

Capturing Topology in Graph Pattern Matching

Normalized cuts and image segmentation

Mind the Gap: Large-Scale Frequent Sequence Mining

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets

Chapter 13 Business Intelligence and Data Warehouses The Need for Data Analysis Business Intelligence. Objectives

Computations with Bounded Errors and Response Times on Very Large Data

Scaling Distributed Machine Learning with the Parameter Server

Searching non-text information objects

MySQL Data Mining: Extending MySQL to support data mining primitives (demo)

On Multiple Query Optimization in Data Mining

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System

Roadmap DB Sys. Design & Impl. Reference. Detailed Roadmap. Motivation. Outline of LRU-K. Buffering - LRU-K

Efficiency vs. Effectiveness in Terabyte-Scale IR

Data Partitioning and MapReduce

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

Scalable Influence Maximization in Social Networks under the Linear Threshold Model

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

summarization Ivan Ivanov, Peter Vajda, Jong-Seok Lee, Touradj Ebrahimi

TRELLIS+: AN EFFECTIVE APPROACH FOR INDEXING GENOME-SCALE SEQUENCES USING SUFFIX TREES

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks

LazyBase: Trading freshness and performance in a scalable database

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

Transcription:

Efficient Aggregation for Graph Summarization Yuanyuan Tian (University of Michigan) Richard A. Hankins (Nokia Research Center) Jignesh M. Patel (University of Michigan)

Motivation Graphs are everywhere Social networks, biological networks Graph datasets growing rapidly in size. 5,, 4,, 3,, Number of Facebook Users 2,, 1,, DB Coauthorship Jun-4 Sep-4 Dec-4 Mar-5 Jun-5 Sep-5 Dec-5 Mar-6 Jun-6 Sep-6 Dec-6 Mar-7 Jun-7 Sep-7 Dec-7 7,445 nodes, 19,971 edges Impossible to understand by mere visual inspection. Need: Graph Summarization

Existing Methods Statistical methods Limited information, hard to interpret & manipulate. Frequent subgraph mining methods Produce a large number of results. Graph partitioning methods Largely ignore node attributes. Graph compression Compact storage. Graph visualization Ben s keynote talk. MDL-based graph summarization (Nisheeth s talk) 3

Solution: Graph Aggregation Two well-defined novel graph aggregation operations: SNAP & k-snap Summarization based on user-selected node attributes and relationships. Produce summaries with controllable resolutions. Provide drill-down and roll-up abilities to navigate multi-resolution summaries. Efficient algorithms Produce meaningful summaries for real applications. Efficient and scalable for very large graphs. 4

SNAP Operation Group nodes by user-selected node attributes & relationships Nodes in each group are homogenous w.r.t. attributes and relationships The grouping with the minimum # groups friends For example: All students in the blue group have the same gender and are in the same dept Every student in the blue group has: at least one friend in the green group at least one classmate in the purple group at least one friend in the orange group at least one classmate in the orange group 5

Evaluating SNAP Operation Top-Down Approach Step 1: group nodes just based on user-selected attributes. Iterative Step: while a group breaks homogeneity requirement for relationships split the group based on its relationships with other groups C B A A A A A C B Group Split A A A B C 6

Limitations of SNAP Operation Problems with the SNAP operation Homogeneity requirement for relationships Noise and uncertainty SNAP 1% % Reality 98% 3% strong relationship weak relationship Users have no control over the resolutions of summaries SNAP operation can result in a large number of small groups k-snap operation: Relax the homogeneity requirement for relationships Let users control the resolutions of summaries Provide drill-down and roll-up abilities to navigate summaries with different resolutions. 7

k-snap Operation Users control # groups in the resulting summary: k Maintain homogeneity requirement for attributes. Relax homogeneity requirement for relationships. Assess the quality of a summary 4 4 Δ= {δ gi,g j (g i ) + δ gi,g j (g j ) } g i,g j 1 8 (95,19):95% ΔMeasure 2 1 8 2 δ gi,g j (g i )= Δ= 5% 5% (weak) P gi,g j (g i ) if p i,j.5 g i P gi,g j (g i ) otherwise Δ+= 3+4 extra participants 95% > 5% (strong) Δ+= (1-95)+(2-19) missing participants 8

Evaluating k-snap Operation Goal: Find the summary of size k with the minimum Δ value (best quality) Proved to be NP-Complete! Infeasible to produce exact k-snap summaries. Alternative: heuristics Top-Down Approach Bottom-Up Approach 9

Top-Down Approach Similar to the SNAP evaluation algorithm (coarse fine) (Difference) At each iteration, it needs to decide: which group to split? how to split the group? Heuristics: Split a group into two subgroups at each iteration Find g i with the maximumδ gi,g j (g i ) (the most contribution to Δ) Split group g i based on whether the nodes in g i connect to g j. max Split g i Δ= {δ gi,g j (g i ) + δ gi,g j (g j ) } g i,g j g i g j g j g i 1

Bottom-Up Approach Compute the SNAP summary first (fine coarse) Iteratively merge two groups until the # groups is k Which two groups to merge? Heuristics: Same attribute values Similar neighbors Similar participation ratio g i p i,k MergeDist(g i, g j )= p i,k - p k,j k i,j g j p k,j g k Merge two groups with the minimum MergeDist. 11

Experimental Evaluation Implementation C++ on top of PostgreSQL Evaluation Platform 2.8GHz P4, 2GB RAM, 25GB SATA disk, FC2 PostgreSQL: version 8.1.3, 512 MB buffer pool Evaluation Measures: Effectiveness & Efficiency Verified by the SIGMOD repeatability committee. 12

Effectiveness: DB Coauthorship DBLP Database Coauthorship Graph (7,445 nodes, 19,971 edges) Node Attributes: name (string), numpub (int), prolific (LP, P, HP) LP:[1, 5], P:[6, 2], HP:[21, -] Relationship: coauthorship SNAP Attribute: prolific Relationship: coauthorship 3,569 groups, 11,293 group relationships 13

Effectiveness: DB Coauthorship SNAP Attribute: prolific k-snap Attribute: prolific Relationship: coauthorship K=4 9.74 36 1.66 1.55 K=7 K=6 K=5 1.26 1.23 2.23 14

Effectiveness: DB Coauthorship Impact of Double-Blind Reviewing on SIGMOD 5 4 3 2 1 5% 26% 1994-2 21-27.4 2 1 91% 75% 1994-2 21-27.3.2.1 73% 32% 1994-2 21-27 Average # Publications.3.2.1 89% 73%.2.1 99% 161% 1994-2 21-27 1994-2 21-27 VLDB.161.35 SIGMOD.125.217.1 26% 139% 1994-2 21-27.1-12% 17% 1994-2 21-27.4.3.2.1 165% 11% 1994-2 21-27

k-snap: Top-Down vs. Bottom-Up Dataset: DBLP DB Coauthorship Graph Quality Measure: Δ/k Top-down beats bottom-up for small k values Execution Time Top-down is much faster than bottom-up Delta/k 15 1 Execution Time (log scale) 5 8 1 1 1 1 1 16 1 32 64 128 256 k (log scale) 512 124 248 3569 1 1 1 1 1 k(log scale) Overall, top-down is the winner! Top-Down Bottom-Up Top-Down Bottom-Up 16

Efficiency: Synthetic Graphs Dataset: Synthetic Power-Law Graphs (by GTgraph) (avg degree:5) Execution Time (sec) 9 8 7 6 5 4 3 2 1 2, 4, 6, 8, 1,, Graph Size (#nodes) k=1 k=1 k=1 17

Conclusion Database-style aggregation for graph summarization Customized summaries Controllable resolutions drill-down and roll-up abilities Meaningful summaries for real applications Efficient and scalable for very large graphs Incorporated in Periscope/GQ graph querying system Combined with other graph operations to perform complex analysis on graphs (VLDB 8 Demo)

Questions? Suggestions? Thanks! 19