The link prediction problem for social networks

Similar documents
Online Social Networks and Media

Link Prediction and Anomoly Detection

Topic mash II: assortativity, resilience, link prediction CS224W

Measuring scholarly impact: Methods and practice Link prediction with the linkpred tool

Link Prediction for Social Network

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

Link Prediction Benchmarks

Using network evolution theory and singular value decomposition method to improve accuracy of link prediction in social networks

Department of Computer Science, Rensselaer Polytechnic Institute Troy, NY 12180

Generating Useful Network-based Features for Analyzing Social Networks

Mining Web Data. Lijun Zhang

Generating Useful Network-based Features for Analyzing Social Networks

Absorbing Random walks Coverage

Absorbing Random walks Coverage

CS 224W Final Report Group 37

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Latent Distance Models and. Graph Feature Models Dr. Asja Fischer, Prof. Jens Lehmann

Extracting Information from Complex Networks

Structural Analysis of Paper Citation and Co-Authorship Networks using Network Analysis Techniques

Link Prediction in Fuzzy Social Networks using Distributed Learning Automata

Study of Data Mining Algorithm in Social Network Analysis

Transitive Node Similarity: Predicting and Recommending Links in Signed Social Networks

Chapter 9 Graph Algorithms

Chapter 9 Graph Algorithms

arxiv: v3 [cs.ni] 3 May 2017

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

Complex Networks. Structure and Dynamics

5.3 Angles and Their Measure

A Naïve Bayes model based on overlapping groups for link prediction in online social networks

Introduction to Complex Networks Analysis

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Centrality in Large Networks

On the impact of small-world on local search

Link Prediction and Topological Feature Importance in Social Networks

Social Information Filtering

Information Networks: PageRank

How to organize the Web?

Graph Model Selection using Maximum Likelihood

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Graph Mining and Social Network Analysis

Social & Information Network Analysis CS 224W

Mining Web Data. Lijun Zhang

Digital Urban Simulation

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Link Prediction in a Modified Heterogeneous Bibliographic Network

Matching Algorithms. Proof. If a bipartite graph has a perfect matching, then it is easy to see that the right hand side is a necessary condition.

How to explore big networks? Question: Perform a random walk on G. What is the average node degree among visited nodes, if avg degree in G is 200?

Example for calculation of clustering coefficient Node N 1 has 8 neighbors (red arrows) There are 12 connectivities among neighbors (blue arrows)

15 Graph Theory Counting the Number of Relations. 2. onto (surjective).

Supervised Link Prediction with Path Scores

Bipartite Editing Prediction in Wikipedia

Guanxi in the Chinese Web. University of Victoria Victoria, BC, Canada {yul, val,

Phd. studies at Mälardalen University

Fast Nearest Neighbor Search on Large Time-Evolving Graphs

Summary: What We Have Learned So Far

Personalized Information Retrieval

Link prediction in multiplex bibliographical networks

Math 778S Spectral Graph Theory Handout #2: Basic graph theory

CAIM: Cerca i Anàlisi d Informació Massiva

Overlay (and P2P) Networks

2013/2/12 EVOLVING GRAPH. Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Yanzhao Yang

Problem 1.R1: How Many Bits?

Structure of Social Networks

Volume 2, Issue 11, November 2014 International Journal of Advance Research in Computer Science and Management Studies

Network Theory: Social, Mythological and Fictional Networks. Math 485, Spring 2018 (Midterm Report) Christina West, Taylor Martins, Yihe Hao

Who should catch (this Exception) {?}

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

PageRank and related algorithms

Degree Distribution: The case of Citation Networks

SOFIA: Social Filtering for Niche Markets

Sampling Large Graphs: Algorithms and Applications

Maximal Monochromatic Geodesics in an Antipodal Coloring of Hypercube

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Small-World Models and Network Growth Models. Anastassia Semjonova Roman Tekhov

arxiv: v1 [cs.si] 8 Apr 2016

Diffusion and Clustering on Large Graphs

Collaborative filtering based on a random walk model on a graph

Social Interaction Based Video Recommendation: Recommending YouTube Videos to Facebook Users

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Supplementary material to Epidemic spreading on complex networks with community structures

A Survey of Link Recommendation for Social Networks: Methods, Theoretical Foundations, and Future Research Directions

Clustering: Classic Methods and Modern Views

5/13/2009. Introduction. Introduction. Introduction. Introduction. Introduction

Lecture 3: Descriptive Statistics

CS-E5740. Complex Networks. Network analysis: key measures and characteristics

Supervised Random Walks on Homogeneous Projections of Heterogeneous Networks

Example 1: An algorithmic view of the small world phenomenon

Link prediction with sequences on an online social network

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Diffusion and Clustering on Large Graphs

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Constructing a G(N, p) Network

Link Prediction in Heterogeneous Collaboration. networks

Matching Theory. Figure 1: Is this graph bipartite?

Graph Theory. Network Science: Graph theory. Graph theory Terminology and notation. Graph theory Graph visualization

Graph Sampling Approach for Reducing. Computational Complexity of. Large-Scale Social Network

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

The Directed Closure Process in Hybrid Social-Information Networks, with an Analysis of Link Formation on Twitter

TEST 3 REVIEW DAVID BEN MCREYNOLDS

Sergei Silvestrov, Christopher Engström. January 29, 2013

Transcription:

The link prediction problem for social networks Alexandra Chouldechova STATS 319, February 1, 2011

Motivation Recommending new friends in in online social networks. Suggesting interactions between the members of a company/organization that are external to the hierarchical structure of the organization itself. Predicting connections between members of terrorist organizations who have not been directly observed to work together. Suggesting collaborations between researchers based on co-authorship.

Statement of the problem Link prediction problem: Given the links in a social network at time t or during a time interval I, we wish to predict the links that will be added to the network during the later time interval from time t to a some given future time.

Statement of the problem Link prediction problem: Given the links in a social network at time t or during a time interval I, we wish to predict the links that will be added to the network during the later time interval from time t to a some given future time. Main approach: Use measures of network-proximity adapted from graph theory, computer science, and the social sciences to determine which unconnected nodes are close together in the topology of the network.

General Notation Social network G = V, E (or G = A, E if nodes are authors) An edge e = u, v E represents an interaction between u and v that took place at time t(e)

General Notation Social network G = V, E (or G = A, E if nodes are authors) An edge e = u, v E represents an interaction between u and v that took place at time t(e) For t < t, let G[t, t ] denote the subgraph of G consisting of all edges that took place between t and t

General Notation Social network G = V, E (or G = A, E if nodes are authors) An edge e = u, v E represents an interaction between u and v that took place at time t(e) For t < t, let G[t, t ] denote the subgraph of G consisting of all edges that took place between t and t For t 0 < t 0 < t 1 < t 1, given G[t 0, t 0 ], we wish to output a list of edges not in G[t 0, t 0 ] that are predicted to appear in G[t 1, t 1 ]

General Notation Social network G = V, E (or G = A, E if nodes are authors) An edge e = u, v E represents an interaction between u and v that took place at time t(e) For t < t, let G[t, t ] denote the subgraph of G consisting of all edges that took place between t and t For t 0 < t 0 < t 1 < t 1, given G[t 0, t 0 ], we wish to output a list of edges not in G[t 0, t 0 ] that are predicted to appear in G[t 1, t 1 ] Let Core V denote the set of all nodes that are incident to at least κ training edges in G[t 0, t 0 ] and at least κ test edges in G[t 1, t 1 ]

Notation for arxiv physics co-authorship network [t 0, t 0 ] are the three years 1993 1996 [t 1, t 1 ] are the three years 1997 1999

Notation for arxiv physics co-authorship network [t 0, t 0 ] are the three years 1993 1996 [t 1, t 1 ] are the three years 1997 1999 G[1993, 1996] = G collab = A, E old

Notation for arxiv physics co-authorship network [t 0, t 0 ] are the three years 1993 1996 [t 1, t 1 ] are the three years 1997 1999 G[1993, 1996] = G collab = A, E old E new is the set of edges u, v such that authors u and v co-authored an article sometime during 1997 1999 but not during 1993 1996

Notation for arxiv physics co-authorship network [t 0, t 0 ] are the three years 1993 1996 [t 1, t 1 ] are the three years 1997 1999 G[1993, 1996] = G collab = A, E old E new is the set of edges u, v such that authors u and v co-authored an article sometime during 1997 1999 but not during 1993 1996 Each link predictor p outputs a ranked list L p of pairs in A A E old. List is ordered according to decreasing values of score(x, y) for x, y A A E old

Notation for arxiv physics co-authorship network [t 0, t 0 ] are the three years 1993 1996 [t 1, t 1 ] are the three years 1997 1999 G[1993, 1996] = G collab = A, E old E new is the set of edges u, v such that authors u and v co-authored an article sometime during 1997 1999 but not during 1993 1996 Each link predictor p outputs a ranked list L p of pairs in A A E old. List is ordered according to decreasing values of score(x, y) for x, y A A E old E new = E new (Core Core), n = E new

Notation for arxiv physics co-authorship network [t 0, t 0 ] are the three years 1993 1996 [t 1, t 1 ] are the three years 1997 1999 G[1993, 1996] = G collab = A, E old E new is the set of edges u, v such that authors u and v co-authored an article sometime during 1997 1999 but not during 1993 1996 Each link predictor p outputs a ranked list L p of pairs in A A E old. List is ordered according to decreasing values of score(x, y) for x, y A A E old E new = E new (Core Core), n = E new Evaluate method by taking the top n edge predictions from L p that are in Core Core and computing the size of the intersection with E new

Methods for Link Prediction: Shortest-path Shortest-path: For x, y A A E old, define, score(x, y) = (negated) length of shortest path between x and y If there are more than n pairs of nodes tied for the shortest path length, order them at random.

Methods for Link Prediction: Neighbourhood-based Let Γ(x) denote the set of neighbours of x in G collab Common neighbours: Based on the idea that links are formed between nodes who share many common neighbours score(x, y) = Γ(x) Γ(y)

Methods for Link Prediction: Neighbourhood-based Let Γ(x) denote the set of neighbours of x in G collab Common neighbours: Based on the idea that links are formed between nodes who share many common neighbours score(x, y) = Γ(x) Γ(y) Jaccard s coefficient: Measure how likely a neighbour of x is to be a neighbour of y and vice versa score(x, y) = Γ(x) Γ(y) Γ(x) Γ(y)

Methods for Link Prediction: Neighbourhood-based Let Γ(x) denote the set of neighbours of x in G collab Common neighbours: Based on the idea that links are formed between nodes who share many common neighbours score(x, y) = Γ(x) Γ(y) Jaccard s coefficient: Measure how likely a neighbour of x is to be a neighbour of y and vice versa score(x, y) = Γ(x) Γ(y) Γ(x) Γ(y) Adamic/Adar: Assigns large weight to common neighbours z of x and y which themselves have few neighbours Γ(z) score(x, y) = z Γ(x) Γ(y) 1 log Γ(z)

Methods for Link Prediction: Preferential attachment Based on the premise that a new edge has node x as its endpoint is proportional to Γ(x). i.e., nodes like to form ties with popular nodes Preferential attachment: Researchers found empirical evidence to suggest that co-authorship is correlated with the product of the neighbourhood sizes score(x, y) = Γ(x) Γ(y)

Methods for Link Prediction: Ensembles of all Paths Katz β measure: Sums over all possible paths between x and y, giving higher weight to shorter paths. score(x, y) = l=1 β l paths (l) x,y where β > 0 and paths (l) x,y is the set of all length-l paths from x to y.

Methods for Link Prediction: Ensembles of all Paths Katz β measure: Sums over all possible paths between x and y, giving higher weight to shorter paths. score(x, y) = l=1 β l paths (l) x,y where β > 0 and paths (l) x,y is the set of all length-l paths from x to y. Two variants of the Katz measure are considered (a) unweighted: paths (l) x,y = 1 if x and y have collaborated and 0 otherwise (b) weighted: paths (l) x,y is the number of times that x and y have collaborated.

Methods for Link Prediction: Hitting and Commute times Consider a random walk on G collab which starts at x and iteratively moves to a neighbour of x chosen uniformly at random from Γ(x). The Hitting Time H x,y from x to y is the expected number of steps it takes for the RW starting at x to reach y. score(x, y) = H x,y

Methods for Link Prediction: Hitting and Commute times Consider a random walk on G collab which starts at x and iteratively moves to a neighbour of x chosen uniformly at random from Γ(x). The Hitting Time H x,y from x to y is the expected number of steps it takes for the RW starting at x to reach y. score(x, y) = H x,y The Commute Time C x,y = H x,y + H y,x is the expected number of steps to travel from x to y then back to x. score(x, y) = C x,y = (H x,y + H y,x )

Methods for Link Prediction: Hitting and Commute times Consider a random walk on G collab which starts at x and iteratively moves to a neighbour of x chosen uniformly at random from Γ(x). The Hitting Time H x,y from x to y is the expected number of steps it takes for the RW starting at x to reach y. score(x, y) = H x,y The Commute Time C x,y = H x,y + H y,x is the expected number of steps to travel from x to y then back to x. score(x, y) = C x,y = (H x,y + H y,x ) Can also consider stationary-normed versions: score(x, y) = H x,y π y score(x, y) = (H x,y π y + H y,x π x )

Methods for Link Prediction: Rooted PageRank The hitting time and commute time measures are sensitive to parts of the graph far away from x and y

Methods for Link Prediction: Rooted PageRank The hitting time and commute time measures are sensitive to parts of the graph far away from x and y Consider instead the random walk on G collab that starts at x that has a probability of α of returning to x at each step. Rooted PageRank: score(x, y) = stationary distribution weight of y under this scheme

Methods for Link Prediction: SimRank SimRank γ : Let similarity(x, y) be a fixed point of a Γ(x) b Γ(y) similarity(a, b) similarity(x, y) = γ Γ(x) Γ(y) where γ [0, 1] score(x, y) = similarity(x, y) This is the expected value of γ l under the random walk probabilities, where l is the time at which random walks started from x and y first meet

Comparisons of the different methods Prediction accuracy will be tabulated in terms of relative improvement over a random predictor

Comparisons of the different methods Prediction accuracy will be tabulated in terms of relative improvement over a random predictor The random predictor simply predicts randomly selected pairs of authors from Core who did not collaborate during the training interval 1993 1996.

Comparisons of the different methods Prediction accuracy will be tabulated in terms of relative improvement over a random predictor The random predictor simply predicts randomly selected pairs of authors from Core who did not collaborate during the training interval 1993 1996. The probability the random prediction is correct is ( Core 2 1 ) Eold This value ranges from 0.15% in cond-mat to 0.48% in astro-ph

References Liben-Nowell, D. and Kleinberg, J. The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7) 1019 1031 (2007)