CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Similar documents
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

How to organize the Web?

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Slides based on those in:

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

CS425: Algorithms for Web Scale Data

Introduction to Data Mining

Big Data Analytics CSCI 4030

Part 1: Link Analysis & Page Rank

Analysis of Large Graphs: TrustRank and WebSpam

Big Data Analytics CSCI 4030

Information Networks: PageRank

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining

Lecture 8: Linkage algorithms and web search

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

COMP 4601 Hubs and Authorities

Link Analysis and Web Search

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Introduction to Information Retrieval

CS6220: DATA MINING TECHNIQUES

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

PageRank and related algorithms

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Information Retrieval and Web Search

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Link Analysis in Web Mining

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Information Retrieval. Lecture 11 - Link analysis

Bruno Martins. 1 st Semester 2012/2013

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

Link Analysis. Hongning Wang

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Unit VIII. Chapter 9. Link Analysis

Brief (non-technical) history

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Link Analysis in the Cloud

Fast Random Walk with Restart: Algorithms and Applications U Kang Dept. of CSE Seoul National University

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Information Retrieval and Web Search Engines

Web Structure Mining using Link Analysis Algorithms

Lecture 8: Linkage algorithms and web search

CS6200 Information Retreival. The WebGraph. July 13, 2015

Searching the Web [Arasu 01]

Mining Web Data. Lijun Zhang

COMP5331: Knowledge Discovery and Data Mining

CS 6604: Data Mining Large Networks and Time-Series

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

A brief history of Google

The PageRank Citation Ranking

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

CSI 445/660 Part 10 (Link Analysis and Web Search)

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

PV211: Introduction to Information Retrieval

Information Retrieval

Collaborative filtering based on a random walk model on a graph

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

COMP Page Rank

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Unsupervised Learning. Pantelis P. Analytis. Introduction. Finding structure in graphs. Clustering analysis. Dimensionality reduction.

Mining Web Data. Lijun Zhang

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

University of Maryland. Tuesday, March 2, 2010

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

Link Structure Analysis

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Lec 8: Adaptive Information Retrieval 2

Motivation. Motivation

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

Recent Researches on Web Page Ranking

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Social Network Analysis

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

Transcription:

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu

How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval investigates: Find relevant docs in a small and trusted set Newspaper articles, Patents, etc. But: Web is huge, full of untrusted documents, random things, web spam, etc. So we need a good way to rank webpages! 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

2 challenges of web search: (1) Web contains many sources of information Who to trust? Trick: Trustworthy pages may point to each other! (2) What is the best answer to query newspaper? No single right answer Trick: Pages that actually know about newspapers might all be pointing to many newspapers 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

All web pages are not equally important www.joe schmoe.com vs. www.stanford.edu We already know: There is large diversity in the web graph node connectivity. So, let s rank the pages using the web graph link structure! vs. 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

We will cover the following Link Analysis approaches to computing importances of nodes in a graph: Hubs and Authorities (HITS) Page Rank Topic Specific (Personalized) Page Rank Sidenote: Various notions of node centrality: Node Degree dentrality = degree of Betweenness centrality = #shortest paths passing through Closeness centrality = avg. length of shortest paths from to all other nodes of the network Eigenvector centrality = like PageRank 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

Goal (back to the newspaper example): Don t just find newspapers. Find experts pages that link in a coordinated way to good newspapers Idea: Links as votes Page is more important if it has more links In coming links? Out going links? Hubs and Authorities Each page has 2 scores: Quality as an expert (hub): Total sum of votes of pages pointed to Quality as an content (authority): Total sum of votes of experts Principle of repeated improvement NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7

Interesting pages fall into two classes: 1. Authorities are pages containing useful information Newspaper home pages Course home pages Home pages of auto manufacturers 2. Hubs are pages that link to authorities List of newspapers Course bulletin List of U.S. auto manufacturers NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

Each page starts with hub score 1 Authorities collect their votes (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and the authority score) 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9

Hubs collect authority scores (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10

Authorities collect hub scores (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11

A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node: Hub score and Authority score Represented as vectors and 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12

[Kleinberg 98] Each page has 2 scores: Authority score: Hub score: HITS algorithm: Initialize: Then keep iterating until convergence: Authority: Hub: Normalize:, Convergence criteria: 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13

[Kleinberg 98] Details! Hits in the vector notation: Vector Adjacency matrix (n x n): if Can rewrite as So: And similarly: Repeat until convergence: Normalize and 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

Details! What is? Then: is updated (in 2 steps): h is updated (in 2 steps) new new Thus, in steps: Repeated matrix powering 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15

Details! Definition: Eigenvectors & Eigenvalues Let for some scalar, vector, matrix Then is an eigenvector, and is its eigenvalue In our case the steady state is: So, authority is eigenvector of (associated with the largest eigenvalue) Similarly: hub is eigenvector of 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16

Still the same idea: Links as votes Page is more important if it has more links In coming links? Out going links? Think of in links as votes: www.stanford.edu has 23,400 in links www.joe schmoe.com has 1 in link Are all in links are equal? Links from important pages count more Recursive question! 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18

A vote from an important page is worth more: Each link s vote is proportional to the importance of its source page If page i with importance r i has d i out links, each link gets r i / d i votes Page j s own importance r j is the sum of the votes on its inlinks i j k r i /3 rk /4 r j /3 r j /3 r j /3 r j = r i /3 + r k /4 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19

A page is important if it is pointed to by other important pages Define a rank r j for node j r j i j r i d i out-degree of node a/2 a The web in 1839 a/2 y/2 y y/2 m Flow equations: r y = r y /2 + r a /2 m You might wonder: Let s just use Gaussian elimination to solve this system of linear equations. Bad idea! r a = r y /2 + r m r m = r a /2 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20

Stochastic adjacency matrix Let page has out links If, then is a column stochastic matrix Columns sum to 1 1/3 j i Rank vector : an entry per page is the importance score of page M The flow equations can be written r j i j r i d i 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21

Imagine a random web surfer: At any time, surfer is on some page At time, the surfer follows an out link from uniformly at random Ends up on some page linked from Process repeats indefinitely Let: vector whose th coordinate is the prob. that the surfer is at page at time So, is a probability distribution over pages r j i 1 i 2 i 3 j i j d out ri (i) 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22

Where is the surfer at time t+1? Follows a link uniformly at random Suppose the random walk reaches a state i 1 i 2 i 3 j p( t 1) M p( t) then is stationary distribution of a random walk Our original rank vector satisfies So, is a stationary distribution for the random walk 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23

Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Assign each node an initial page rank Repeat until convergence ( i r i (t+1) r i (t) < ) Calculate the page rank of each node r ( t1) j i j r ( t) i d i. out-degree of node 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25

Power Iteration: Set /N 1: 2: If : goto 1 Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26

Power Iteration: Set /N 1: 2: If : goto 1 Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27

r ( t1) j i j r ( t) i d i or equivalently r Mr Does this converge? Does it converge to what we want? Are results reasonable? 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28

The Spider trap problem: a b r ( t1) j i j r ( t) i d i Example: r a 1 0 1 0 = r b 0 1 0 1 Iteration 0, 1, 2, 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29

The Dead end problem: a b r ( t1) j i j r ( t) i d i Example: r a 1 0 0 0 = r b 0 1 0 0 Iteration 0, 1, 2, 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30

2 problems: (1) Some pages are dead ends (have no out links) Such pages cause importance to leak out (2) Spider traps (all out links are within the group) Eventually spider traps absorb all importance 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31

Power Iteration: Set And iterate a y m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 1 r y = r y /2 + r a /2 r a = r y /2 Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, r m = r a /2 + r m 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32

The Google solution for spider traps: At each time step, the random surfer has two options With prob., follow a link at random With prob. 1, jump to a random page Common values for are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps y y a m a m 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33

Power Iteration: Set And iterate Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, a y m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 r m = r a /2 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34

Teleports: Follow random teleport links with probability 1.0 from dead ends Adjust matrix accordingly y y a m a m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 y a m y ½ ½ ⅓ a ½ 0 ⅓ m 0 ½ ⅓ 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35

Google s solution: At each step, random surfer has two options: With probability, follow a link at random With probability 1-, jump to some random page PageRank equation [Brin Page, 98] The above formulation assumes that has no dead ends. We can either preprocess matrix (bad!) or explicitly follow random teleport links with probability 1.0 from dead-ends. See P. Berkhin, A Survey on PageRank Computing, Internet Mathematics, 2005. d i out degree of node i 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 36

Details! PageRank as a principal eigenvector But we really want: Let s define: or equivalently Now we get what we want: What is? In practice 0.15 (5 links and jump) d i out degree of node i Note: is a sparse matrix but is dense (all entries 0). In practice we never materialize but rather we use the sum formulation 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37

Details! Input: Graph Directed graph Parameter and parameter Output: PageRank vector Set: do: : with spider traps and dead ends if in deg. of is 0 Now re insert the leaked PageRank: : while where: 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 38

11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 39

PageRank and HITS are two solutions to the same problem: What is the value of an in link from u to v? In the PageRank model, the value of the link depends on the links into u In the HITS model, it depends on the value of the other links out of u The destinies of PageRank and HITS post 1998 were very different 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 40

[Tong Faloutsos, 06] a.k.a.: Relevance, Closeness, Similarity 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 42

Shortest path is not good: No influence for degree 1 nodes (E, F, G)! Multi faceted relationships 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 43

Network Flow is not good: Does not punish long paths 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 44

[Tong Faloutsos, 06] Multiple Connections Quality of connection Direct & In direct connections Length, Degree, Weight 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 45

9 10 2 12 1 3 8 11 4 5 6 7 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 46

Goal: Evaluate pages not just by popularity but by how close they are to the topic Teleporting can go to: Any page with equal probability (we used this so far) A topic specific set of relevant pages Topic specific (personalized) PageRank if otherwise (S...teleport set) Random Walk with Restart: S is a single element 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 47

Graphs and web search: Ranks nodes by importance Personalized PageRank: Ranks proximity of nodes to the teleport nodes Proximity on graphs: Q: What is most related conference to ICDM? Random Walks with Restarts Teleport back to the starting node: S = { single node } IJCAI KDD ICDM SDM AAAI NIPS Conference Philip S. Yu Ning Zhong R. Ramakrishnan M. Jordan Author 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 48

Node 4 0.13 1 0.10 2 3 4 0.13 0.13 5 7 0.04 9 0.08 8 6 0.05 0.05 0.03 10 11 0.04 12 0.02 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 0.13 0.10 0.13 0.22 0.13 0.05 0.05 0.08 0.04 0.03 0.04 0.02 Nearby nodes, higher scores More red, more relevant Ranking vector r 4 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 49

I K Graph of CS conferences Q: Which conferences are closes to KDD & ICDM? A: Personalized PageRank with teleport set S={KDD, ICDM} 11/5/2013 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 50