CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu
How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval investigates: Find relevant docs in a small and trusted set Newspaper articles, Patents, etc. But: Web is huge, full of untrusted documents, random things, web spam, etc. 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
Does more documents mean better results? Billions of documents Search engine index size over time 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
2 challenges of web search: (1) Web contains many sources of information Who to trust? Trick: Trustworthy pages may point to each other! (2) What is the best answer to query newspaper? No single right answer Trick: Pages that actually know about newspapers might all be pointing to many newspapers 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
All web pages are not equally important www.joe schmoe.com vs. www.stanford.edu We already know: There is large diversity in the web graph node connectivity. Let s rank the pages by the link structure! vs. 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
We will cover the following Link Analysis approaches to computing importances of nodes in a graph: Hubs and Authorities (HITS) Page Rank Topic Specific (Personalized) Page Rank Sidenote: Various notions of node centrality: Node Degree dentrality = degree of Betweenness centrality = #shortest paths passing through Closeness centrality = avg. length of shortest paths from to all other nodes of the network Eigenvector centrality = like PageRank 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6
Goal (back to the newspaper example): Don t just find newspapers. Find experts pages that link in a coordinated way to good newspapers Idea: Links as votes Page is more important if it has more links In coming links? Out going links? Hubs and Authorities Each page has 2 scores: Quality as an expert (hub): Total sum of votes of pages pointed to Quality as an content (authority): Total sum of votes of experts Principle of repeated improvement 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7 NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9
Interesting pages fall into two classes: 1. Authorities are pages containing useful information Newspaper home pages Course home pages Home pages of auto manufacturers 2. Hubs are pages that link to authorities List of newspapers Course bulletin List of US auto manufacturers NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
Each page starts with hub score 1 Authorities collect their votes (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
Hubs collect authority scores (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10
Authorities collect hub scores (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node: Hub score and Authority score Represented as vectors and 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
[Kleinberg 98] Each page has 2 scores: j 1 j 2 j 3 j 4 Authority score: Hub score: HITS algorithm: Initialize: Then keep iterating until convergence: i i Authority: Hub: Normalize:, j 1 j 2 j 3 j 4 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
[Kleinberg 98] HITS converges to a single stable point Notation: Vector 1 1 Adjacency matrix (n x n): if Then can be rewriten as So: And likewise: 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
HITS algorithm in vector notation: Set: Repeat until convergence: Normalize Then: Thus, in and steps: new new Convergence criterion: is updated (in 2 steps): h is updated (in 2 steps): Repeated matrix powering 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
Definition: Let for some scalar, vector, matrix Then Fact: is an eigenvector, and is its eigenvalue If is symmetric ( ) (in our case and are symmetric) Then has orthogonal unit eigenvectors 1 that form a basis (coordinate system) with eigenvalues 1 ( ) 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16
Let s write in coordinate system 1 has coordinates ( 1 ) Suppose: 1 ( 1 ) Then: As, if we normalize (contribution of all other coordinates 0) So, authority is eigenvector of lim associated with largest eigenvalue 1 Similarly: hub is eigenvector of 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17
Still the same idea: Links as votes Page is more important if it has more links In coming links? Out going links? Think of in links as votes: www.stanford.edu has 23,400 in links www.joe schmoe.com has 1 in link Are all in links are equal? Links from important pages count more Recursive question! 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
A vote from an important page is worth more A page is important if it is pointed to by other important pages Define a rank r j for node j r j i j r i d i out-degree of node a/2 a The web in 1839 a/2 y/2 y y/2 m Flow equations: r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 m 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20
Stochastic adjacency matrix Let page has out links else is a column stochastic matrix Columns sum to 1 If, then Rank vector : vector with an entry per page is the importance score of page The flow equations can be written r j r i i j d i 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21
Imagine a random web surfer: At any time, surfer is on some page At time, the surfer follows an out link from uniformly at random Ends up on some page linked from Process repeats indefinitely Let: vector whose th coordinate is the prob. that the surfer is at page at time So, is a probability distribution over pages r j i 1 i 2 i 3 j i j d out ri (i) 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
Where is the surfer at time t+1? Follows a link uniformly at random Suppose the random walk reaches a state i 1 i 2 i 3 j p( t 1) M p( t) then is stationary distribution of a random walk Our original rank vector satisfies So, is a stationary distribution for the random walk 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Assign each node an initial page rank Repeat until convergence ( i r i (t+1) r i (t) < ) Calculate the page rank of each node r ( t1) j i j r ( t) i d i. out-degree of node 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24
Power Iteration: Set And iterate Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
r ( t1) j i j r ( t) i d i or equivalently r Mr Does this converge? Does it converge to what we want? Are results reasonable? 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26
a b r ( t1) j i j r ( t) i d i Example: r a 1 0 1 0 = r b 0 1 0 1 Iteration 0, 1, 2, 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
a b r ( t1) j i j r ( t) i d i Example: r a 1 0 0 0 = r b 0 1 0 0 Iteration 0, 1, 2, 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28
2 problems: (1) Some pages are dead ends (have no out links) Such pages cause importance to leak out (2) Spider traps (all out links are within the group) Eventually spider traps absorb all importance 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29
Power Iteration: y y a m y ½ ½ 0 Set And iterate a m a ½ 0 0 m 0 ½ 1 r y = r y /2 + r a /2 r a = r y /2 Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, r m = r a /2 + r m 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30
The Google solution for spider traps: At each time step, the random surfer has two options With prob., follow a link at random With prob. 1, jump to some page uniformly at random Common values for are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps y y a m a m 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31
Power Iteration: Set And iterate Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, a y m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 r m = r a /2 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32
Teleports: Follow random teleport links with probability 1.0 from dead ends Adjust matrix accordingly y y a m a m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 y a m y ½ ½ ⅓ a ½ 0 ⅓ m 0 ½ ⅓ 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33
Google s solution: At each step, random surfer has two options: With probability 1-, follow a link at random With probability, jump to some random page PageRank equation [Brin Page, 98] The above formulation assumes that has no dead ends. We can either preprocess matrix (bad!) or explicitly follow random teleport links with probability 1.0 from dead-ends. See P. Berkhin, A Survey on PageRank Computing, Internet Mathematics, 2005. d i out degree of node i 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34
PageRank as a principal eigenvector or equivalently But we really want: Let s define: d i out degree of node i Now we get what we want: What is? In practice 0.15 (5 links and jump) Note: is a sparse matrix but is dense (all entries 0). In practice we never materialize but rather we use the sum formulation 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35
Input: and Adjacency matrix of a directed graph with spider traps and dead ends Parameter Output: PageRank vector Set: Repeat until: Now re insert the leaked PageRank: See P. Berkhin, A Survey on PageRank Computing, Internet Mathematics, 2005., if in deg. of is 0 then Where: 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 36
11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37
PageRank and HITS are two solutions to the same problem: What is the value of an in link from u to v? In the PageRank model, the value of the link depends on the links into u In the HITS model, it depends on the value of the other links out of u The destinies of PageRank and HITS post 1998 were very different 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 38
Goal: Evaluate pages not just by popularity but by how close they are to the topic Teleporting can go to: Any page with equal probability (we used this so far) A topic specific set of relevant pages Topic specific (personalized) PageRank (S...teleport set) if otherwise Useful for measuring proximity of other nodes to 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 40
Graphs and web search: Ranks nodes by importance Personalized PageRank: Ranks proximity of nodes to the teleport nodes Proximity on graphs: Q: What is most related conference to ICDM? Random Walks with Restarts Teleport back to the starting node: S = { single node } IJCAI KDD ICDM SDM AAAI NIPS Conference Philip S. Yu Ning Zhong R. Ramakrishnan M. Jordan Author 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 41
Link Farms: networks of millions of pages design to focus PageRank on a few undeserving webpages To minimize their influence use a teleport set of trusted webpages E.g., homepages of universities 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 42
11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 44
Rich get richer [Cho et al., WWW 04] Two snapshots of the web graph at two different time points Measure the change: In the number of in links PageRank http://oak.cs.ucla.edu/~cho/papers/cho-bias.pdf 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 45