Link Prediction in Graph Streams

Similar documents
Edge Classification in Networks

gsketch: On Query Estimation in Graph Streams

Link Prediction for Social Network

On Biased Reservoir Sampling in the Presence of Stream Evolution

On Dimensionality Reduction of Massive Graphs for Indexing and Retrieval

Predictive Indexing for Fast Search

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

A Framework for Clustering Massive Text and Categorical Data Streams

Online Social Networks and Media

The complement of PATH is in NL

TELCOM2125: Network Science and Analysis

MIDTERM EXAMINATION Networked Life (NETS 112) November 21, 2013 Prof. Michael Kearns

Collective Entity Resolution in Relational Data

Compressing Social Networks

Association Pattern Mining. Lijun Zhang

Spectral Clustering and Community Detection in Labeled Graphs

On Compressing Social Networks. Ravi Kumar. Yahoo! Research, Sunnyvale, CA. Jun 30, 2009 KDD 1

Analyzing Graph Structure via Linear Measurements.

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

A Framework for Clustering Massive-Domain Data Streams

Similarity Ranking in Large- Scale Bipartite Graphs

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

SimRank Computation on Uncertain Graphs

Kernel Methods & Support Vector Machines

Module 1: Introduction to Finite Difference Method and Fundamentals of CFD Lecture 6:

Exam Principles of Distributed Computing

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Introduction Types of Social Network Analysis Social Networks in the Online Age Data Mining for Social Network Analysis Applications Conclusion

Mining Web Data. Lijun Zhang

On Dense Pattern Mining in Graph Streams

Predicting Social Links for New Users across Aligned Heterogeneous Social Networks

Predicting User Ratings Using Status Models on Amazon.com

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

A SURVEY OF SYNOPSIS CONSTRUCTION IN DATA STREAMS

Basic relations between pixels (Chapter 2)

Kernel-based online machine learning and support vector reduction

Value Added Association Rules

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Graph Algorithms. Definition

This Talk. Map nodes to low-dimensional embeddings. 2) Graph neural networks. Deep learning architectures for graphstructured

31.6 Powers of an element

AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks

Random projection for non-gaussian mixture models

Chapter 3 Trees. Theorem A graph T is a tree if, and only if, every two distinct vertices of T are joined by a unique path.

Link prediction in multiplex bibliographical networks

Analytic Performance Models for Bounded Queueing Systems

BigML homework 6: Efficient Approximate PageRank

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies.

Problem 1: Complexity of Update Rules for Logistic Regression

EDGE OFFSET IN DRAWINGS OF LAYERED GRAPHS WITH EVENLY-SPACED NODES ON EACH LAYER

Leveraging Set Relations in Exact Set Similarity Join

Unit 4: Formal Verification

Lesson 4. Random graphs. Sergio Barbarossa. UPC - Barcelona - July 2008

Jaccard Coefficients as a Potential Graph Benchmark

LECTURES 3 and 4: Flows and Matchings

Leveraging Transitive Relations for Crowdsourced Joins*

Parallel Peeling Algorithms. Justin Thaler, Yahoo Labs Joint Work with: Michael Mitzenmacher, Harvard University Jiayang Jiang

Parallel Peeling Algorithms. Justin Thaler, Yahoo Labs Joint Work with: Michael Mitzenmacher, Harvard University Jiayang Jiang

Semi-Supervised Clustering with Partial Background Information

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University

Extracting Information from Complex Networks

Local Algorithms for Sparse Spanning Graphs

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

arxiv: v3 [cs.dm] 12 Jun 2014

4. Image Retrieval using Transformed Image Content

Constructing a G(N, p) Network

A Framework for Clustering Massive Graph Streams

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

Link Sign Prediction and Ranking in Signed Directed Social Networks

CSE 373 Autumn 2012: Midterm #2 (closed book, closed notes, NO calculators allowed)

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Coloring Signed Graphs

Exact Algorithms Lecture 7: FPT Hardness and the ETH

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Cellular Automata. Nicholas Geis. January 22, 2015

Disclaimer. Lect 2: empirical analyses of graphs

1 Bipartite maximum matching

Topic mash II: assortativity, resilience, link prediction CS224W

* (4.1) A more exact setting will be specified later. The side lengthsj are determined such that

On Biased Reservoir Sampling in the presence of Stream Evolution

Stanford University CS359G: Graph Partitioning and Expanders Handout 18 Luca Trevisan March 3, 2011

Sparse Hypercube 3-Spanners

Lecture 11: Clustering and the Spectral Partitioning Algorithm A note on randomized algorithm, Unbiased estimates

Data-Intensive Distributed Computing

A Improving Classification Quality in Uncertain Graphs

Scalable Video Transport over Wireless IP Networks. Dr. Dapeng Wu University of Florida Department of Electrical and Computer Engineering

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14

The Encoding Complexity of Network Coding

Author(s): Rahul Sami, 2009

Graphs. Carlos Moreno uwaterloo.ca EIT

Analysis of Link Reversal Routing Algorithms for Mobile Ad Hoc Networks

Scaling Link-Based Similarity Search

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

On Clustering Graph Streams

Constructing a G(N, p) Network

Random Walks and Cover Times. Project Report. Aravind Ranganathan. ECES 728 Internet Studies and Web Algorithms

Transcription:

Peixiang Zhao, Charu C. Aggarwal, and Gewen He Florida State University IBM T J Watson Research Center Link Prediction in Graph Streams ICDE Conference, 2016

Graph Streams Graph Streams arise in a wide variety of applications Activity-centric social and information networks Communication networks Chat networks Graph streams are ubiquitous in settings in which activity is overlaid on network structure

Challenges Difficult to store the entire graph on disk because of high volume stream Graph applications such as link prediction require structural understanding of the graph Key is to design probabilistic summaries that can work in the link-prediction setting

The Link Prediction Problem Given a network G =(N, A), determine the most likely pairs of nodes that receive a link between them. Typical classes of algorithms: Neighborhood-based Matrix factorization Supervised

Contributions of this Work Link prediction algorithm for graph streams Design summarization schemes to use small-space data structures for prediction Is able to achieve results closer to their exact counterparts Design the real-time analogs to Jaccard, Adamic-Adar, and common neighbor.

Key Techniques Used We design MinHash based graph sketches to estimate Jaccard coefficient. Vertex-biased reservoir sampling based graph sketches to estimate common neighbor and Adamic-Adar. Provide theoretical guarantees to exact estimation Robust performance with respect to exact estimation

Streaming Setting We consider the streaming link prediction problem in a graph stream that receives a sequence (e 0,e 1,...,e t,...) of edges. Each of which is in the form of e i =(u, v) at the time point i, whereu, v V are incident vertices of the edge e i. We assume the underlying graph is undirected, where the ordering of u and v of the edge is insignificant. The proposed graph sketches and corresponding estimation algorithms can be generalized to directed graphs with minor revision.

Formalization At any given moment of time t, the edges seen thus far from the graph stream imply a conceptual graph G(t) = (V (t),e(t)), where V (t) is the set of vertices, and E(t) isthe set of distinct edgesuptotimet. We use τ(u, t) to represent the set of adjacent vertices of the vertex u in the graph G(t). In other words, τ(u, t) contains the distinct vertices adjacent to u in the graph stream till t, andthedegreeofu is denoted as d(u, t) = τ(u, t). Given a graph stream G(t) at time t, the streaming link prediction problem is to predict whether there is or will be an edge e =(u, v) for any pair of vertices u, v V (t) and e E(t).

Target Measures In a graph stream G(t) and for any u, v V (t), the target measures for streaming link prediction are defined as follows, 1. Preferential attachment: τ(u, t) τ(v, t) ; 2. Common neighbor: τ(u, t) τ(v, t) ; 3. Jaccard coefficient: τ(u,t) τ(v,t) τ(u,t) τ(v,t) ; 4. Adamic-Adar: w τ(u,t) τ(v,t) 1 log( τ(w,t) )

Observations These measures seem to be relatively trivial to compute in the static setting. However, in the streaming setting, we do not have a global view of the graph at any given time. This makes simple computations surprisingly difficult As an example, let us look at the simplest measure of preferential attachment

Preferential Attachment Preferential attachment, it requires an accurate estimation of the number of distinct edges incident on each vertex, i.e., τ(u, t),u V (t). This is not as easy as it sounds because the distinct edges cannot be explicitly maintained and updated for exact counting in a graph stream. Streaming methods for distinct element counting may be employed. Other neighborhood methods are much harder and will be the focus of the paper.

Jaccard Coefficient Natural approach is the min-hash index: has been used earlier in various applications for computation of Jaccard coefficient (Broder et al) Given a graph stream, we maintain a MinHash based graph sketch for each vertex u with two key values: the minimum adjacent hash value, H(u), and the minimum adjacent hash index, I(u). H(u) =minh(v), (u, v) E(t) (1) I(u) =argminh(v), (u, v) E(t) (2) The Jaccard coefficient between a pair of nodes is equal to the probability that their min-hash indices are the same

Basic Intuition Consider two columns in a binary matrix if you sort the rows randomly, what is the probability that both columns show values of 1, when at least one of them shows values of 1 Simulated by the Jaccard coefficient Also captured by the min-hash index: Each hash function simulates a sort and multiple hash functions are used for robustness Can use Chernoff bound to provide guarantee

Common Neighbor Estimation To estimate common neighbors of two vertices in a graph stream, a natural question arises: whether a sample of edges from the graph stream are good enough to estimate this target measure accurately? Unfortunately, unbiased sampling provides a negative answer to this question mainly due to the power-law degree distribution of real-world graphs.

Example For example, consider a graph in which the top 1% of the high-degree vertices (degrees greater than 10) contain 99% of the edges. A down-sampling with a rate of 10% will capture the neighborhood information of such top 1% high-degree vertices robustly, but may not capture even a single edge for the remaining 99% low-degree vertices. Result: poor predictions!

Solution: Vertex-biased Sampling To address this issue, we design the vertex-biased sampling based graph sketches, where for each vertex u G(t), a reservoir S(u) of budget L is associated to dynamically sample L incident edges of u. Note if an edge (u, v) is sampled, the vertex v, not the edge itself, is maintained and updated in u s reservoir, S(u)). This sampling approach is biased, because the number of sampled edges, L, is fixed for all vertices in the graph stream, and is therefore disproportionately higher for low-degree vertices than high-degree ones.

Issues Sampling L incident edges of each vertex can be implemented by the traditional reservoir sampling method. Another problem arises for high-degree vertices instead: for two vertices with degrees larger than L, the number of common neighbors is hard to estimate accurately from their independent samples. For example, consider the case where the budget is L = 10, and the two vertices u and v have a degree 1, 000 each with all such 1, 000 incident vertices being common neighbors of u and v. Then, if 10 adjacent neighbors of u and v are sampled independently, the expected number of common neighbors of

u and v is 10 10/1000 = 0.1, which definitely is not an accurate estimation.

Solutions To this end, we adopt this constant budget-based sampling approach with an important caveat: random samples in different reservoirs of vertices are forced to be dependent on each other. To model this dependency, we consider an implicit sorting order of vertices to impose a priority order on the vertex set of the graph stream. Such a priority order is enforced with the use of a hash function G : u V (0, 1), where the argument u is a vertex identifier, and the output is a real number in the range (0, 1) with lower values indicating higher priority.

Specifically, vertices retained in the reservoir S(u) are those having the highest L priority orders among all incident neighbors of u.

Sampling Rate In order to dynamically maintain and update S(u) when the graph stream evolves, it is important to note that, if the degree of the vertex u satisfies d(u, t) >L, only a fraction η(u, t) = L/d(u, t) of all the incident vertices of u can be retained in the sketch S(u). To account for this, we define a threshold of the priority value G( ) for each incident neighbor of u that can survive in the reservoir S(u) as η(u, t) =min{1,l/d(u, t)} (3)

Estimating number of common neighbors Let η(u, t) and η(v, t) be the fraction (threshold) of incident neighbors of u and v, respectively, which are sampled. The number of common neighbors, C uv,ofu and v is C uv = S(u) S(v) max{η(u, t),η(v, t)} (4) Paper discusses the trick to dynamically maintain η(u, t) and η(v, t).

Adamic-Adar An important observation of Adamic-Adar is that the common neighbor w of vertices u and v with a higher vertexdegree is weighted less, because of its proclivity to be a noisy vertex, as accounted for in the term 1/log( τ(w, t) ). Such a weighting strategy makes it very challenging to estimate Adamic-Adar in graph streams. Very high-degree vertices contribute noice.

Truncated Adamic-Adar We consider an approximate variation of Adamic-Adar, which truncates the insignificant high-degree common neighbors of u and v for streaming link prediction, because they contribute little in the computation of Adamic-Adar. It turns out that such an approximation still preserves the accuracy of Adamic-Adar, but can be computed exactly based on the vertex-biased sampling based graph sketches. The truncated Adamic-Adar, denoted as AA(u, v), between two vertices u and v, is defined in the same way as Adamic- Adar, except that the components contributed by common neighbors with degrees greater than d max are eliminated: AA(u, v) = {w τ(u,t) τ(v,t): τ(w,t) d max } 1 log( τ(w, t) ) (5)

Approach The core idea is that each of the components in the equation can be directly computed if both vertices u and v are present in the graph sketch S(w) of their common neighbor w, whose degree is no larger than d max. At the beginning, we initialize the capacity of the reservoirs, L = d max, for all vertices. This is because vertices whose degrees are larger than d max will not be considered in the computation of truncated Adamic-Adar

Truncated Adamic-Adar We then examine all the graph sketches one by one. For each vertex w, both vertices u and v need be present in S(w) (Line 5). It is further checked whether w has a degree at most d max by examining if the values of η(w) andη(w) equal to 1, indicating that the graph sketch S(w) is not full yet. If so, (1/log( τ(w, t) )) is added to AA(u, v). Note that the degree value, τ(w, t), is equal to the current reservoir size, S(w).

Handling computational bottlenecks The main computational bottleneck is that the graph sketches of all vertices need to be scanned once for each incoming new edge (u, v) in the graph stream. To alleviate this problem, we consider adopting inverted indexes to facilitate the computation of truncated Adamic- Adar. Specifically, for each vertex u in the graph stream, an auxiliary inverted index structure, L(u), is built to maintain vertex identifiers of v whose graph sketch S(v) contains u, i.e., v L(u) if and only if u S(v). Inverted indices are dynamically maintained and they double the memory requirement.

Experimental Results We chose three real-world, publicly available graph datasets in our experimental studies. Note that their edges are attached with timestamps indicating when they are first created in the graphs, and thus can be ordered, formulated, and processed in a form of graph streams DBLP, Amazon Product co-purchasing, Wikipedia citation network

Prediction Accuracy Compute accuracy with respect to a random predictor Ratio of the accuracy to that of a random predictor Values greater than 1 are good.

DBLP Results Methods Ext-Ja App-Ja Ext-CN App-CN Ext-Adar App-Adar Accuracy 120.57 104.25 112.66 108.91 93.47 92.57 Methods Katz PrefAttach PropFlow R-PRank Short-Path SimRank Accuracy 100.48 114.97 96.76 59.82 107.49 74.85

Amazon Co-Purchasing Network Methods Ext-Ja App-Ja Ext-CN App-CN Ext-Adar App-Adar Accuracy 116.15 106.74 121.07 109.70 116.57 116.15 Methods Katz PrefAttach PropFlow R-PRank Short-Path SimRank Accuracy 88.38 116.23 98.75 71.48 147.70 110.95

Progression with Stream Size (DBLP) Link Prediction Accuracy 140 120 100 80 60 40 20 App-Ja Ext-Ja 0.4M 0.6M 0.8M 1M 1.2M All Graph Stream Size (a) Jaccard Link Prediction Accuracy 140 120 100 80 60 40 20 App-CN Ext-CN 0.4M 0.6M 0.8M 1M 1.2M All Graph Stream Size (b) Common neighbor Link Prediction Accuracy 140 120 100 80 60 40 20 App-Adar Ext-Adar 0.4M 0.6M 0.8M 1M 1.2M All Graph Stream Size (c) Adamic-Adar

Performance with respect to algorithm parameters Link Prediction Accuracy 140 120 100 80 60 40 20 0 DBLP Amazon Wikipedia 10 20 35 50 100 Number of Min-hash Functions (a) Jaccard Link Prediction Accuracy 140 120 100 80 60 40 20 0 DBLP Amazon Wikipedia 15 30 60 80 100 Reservoir Budget (b) Common Neighbor Link Prediction Accuracy 180 160 140 120 100 80 60 40 20 0 DBLP Amazon Wikipedia 15 30 60 80 100 Reservoir Budget (c) Adamic-Adar

Space with respect to algorithm parameters Space Cost (MB) Space Cost (MB) 3.5 3 2.5 2 1.5 1 0.5 0 18 16 14 12 10 8 6 4 DBLP Amazon Wikipedia 10 20 35 50 100 Number of Min-hash Functions DBLP Amazon Wikipedia (a) Jaccard 15 30 60 80 100 Reservoir Budget (c) Adamic-Adar Space Cost (MB) 3 2.5 2 1.5 1 0.5 0 DBLP Amazon Wikipedia 15 30 60 80 100 Reservoir Budget (b) Common Neighbor

Runtime Cost vs Graph Stream Size Runtime Cost Graph Stream Size S (in seconds) 0.4M 0.8M 1.2M All App-Ja 0.004 0.04 0.09 0.12 Ext-Ja 21.68 135.16 833.50 2748.40 App-CN 0.007 0.048 0.09 0.395 Ext-CN 18.0 194.31 801.72 2935.22 App-Adar 0.29 4.45 5.05 19.86 Ext-Adar 16.44 260.13 1193.77 3308.26

Conclusions Newmethodforlinkpredictioningraphstreams Generalizes common neighborhood methods for graph streams Future work will also generalize more advanced techniques for graph streams Design methods for incorporating content in link prediction