CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Similar documents
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

How to organize the Web?

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Slides based on those in:

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

Big Data Analytics CSCI 4030

CS425: Algorithms for Web Scale Data

Big Data Analytics CSCI 4030

Part 1: Link Analysis & Page Rank

Introduction to Data Mining

Analysis of Large Graphs: TrustRank and WebSpam

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Information Networks: PageRank

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Lecture 8: Linkage algorithms and web search

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Link Analysis and Web Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

PageRank and related algorithms

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

COMP 4601 Hubs and Authorities

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

CS6200 Information Retreival. The WebGraph. July 13, 2015

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

COMP5331: Knowledge Discovery and Data Mining

Introduction to Information Retrieval

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

CS6220: DATA MINING TECHNIQUES

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

Unit VIII. Chapter 9. Link Analysis

Information Retrieval. Lecture 11 - Link analysis

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

Link Analysis in Web Mining

Brief (non-technical) history

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Bruno Martins. 1 st Semester 2012/2013

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Link Analysis in the Cloud

Searching the Web [Arasu 01]

Information Retrieval and Web Search

Lecture 8: Linkage algorithms and web search

Information Retrieval and Web Search Engines

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Motivation. Motivation

Fast Random Walk with Restart: Algorithms and Applications U Kang Dept. of CSE Seoul National University

Web Structure Mining using Link Analysis Algorithms

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Link Analysis. Hongning Wang

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Mining Web Data. Lijun Zhang

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Administrative. Web crawlers. Web Crawlers and Link Analysis!

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Jeffrey D. Ullman Stanford University

A brief history of Google

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

The PageRank Citation Ranking

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

CS 6604: Data Mining Large Networks and Time-Series

CSI 445/660 Part 10 (Link Analysis and Web Search)

CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

Collaborative filtering based on a random walk model on a graph

Extracting Information from Complex Networks

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Unsupervised Learning. Pantelis P. Analytis. Introduction. Finding structure in graphs. Clustering analysis. Dimensionality reduction.

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Lec 8: Adaptive Information Retrieval 2

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Popularity of Twitter Accounts: PageRank on a Social Network

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Supervised Random Walks

Recent Researches on Web Page Ranking

Link Structure Analysis

PV211: Introduction to Information Retrieval

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Social Network Analysis

Information Retrieval

Bibliometrics: Citation Analysis

Temporal Graphs KRISHNAN PANAMALAI MURALI

Lecture 17 November 7

COMP Page Rank

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking

CS435 Introduction to Big Data FALL 2018 Colorado State University. 9/24/2018 Week 6-A Sangmi Lee Pallickara. Topics. This material is built based on,

Mining Web Data. Lijun Zhang

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Learning to Rank Networked Entities

Transcription:

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu

How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval investigates: Find relevant docs in a small and trusted set Newspaper articles, Patents, etc. But: Web is huge, full of untrusted documents, random things, web spam, etc. 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

Does more documents mean better results? Billions of documents Search engine index size over time 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

2 challenges of web search: (1) Web contains many sources of information Who to trust? Trick: Trustworthy pages may point to each other! (2) What is the best answer to query newspaper? No single right answer Trick: Pages that actually know about newspapers might all be pointing to many newspapers 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

All web pages are not equally important www.joe schmoe.com vs. www.stanford.edu We already know: There is large diversity in the web graph node connectivity. Let s rank the pages by the link structure! vs. 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

We will cover the following Link Analysis approaches to computing importances of nodes in a graph: Hubs and Authorities (HITS) Page Rank Topic Specific (Personalized) Page Rank Sidenote: Various notions of node centrality: Node Degree dentrality = degree of Betweenness centrality = #shortest paths passing through Closeness centrality = avg. length of shortest paths from to all other nodes of the network Eigenvector centrality = like PageRank 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6

Goal (back to the newspaper example): Don t just find newspapers. Find experts pages that link in a coordinated way to good newspapers Idea: Links as votes Page is more important if it has more links In coming links? Out going links? Hubs and Authorities Each page has 2 scores: Quality as an expert (hub): Total sum of votes of pages pointed to Quality as an content (authority): Total sum of votes of experts Principle of repeated improvement 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7 NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9

Interesting pages fall into two classes: 1. Authorities are pages containing useful information Newspaper home pages Course home pages Home pages of auto manufacturers 2. Hubs are pages that link to authorities List of newspapers Course bulletin List of US auto manufacturers NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

Each page starts with hub score 1 Authorities collect their votes (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9

Hubs collect authority scores (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10

Authorities collect hub scores (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11

A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node: Hub score and Authority score Represented as vectors and 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12

[Kleinberg 98] Each page has 2 scores: j 1 j 2 j 3 j 4 Authority score: Hub score: HITS algorithm: Initialize: Then keep iterating until convergence: i i Authority: Hub: Normalize:, j 1 j 2 j 3 j 4 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13

[Kleinberg 98] HITS converges to a single stable point Notation: Vector 1 1 Adjacency matrix (n x n): if Then can be rewriten as So: And likewise: 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

HITS algorithm in vector notation: Set: Repeat until convergence: Normalize Then: Thus, in and steps: new new Convergence criterion: is updated (in 2 steps): h is updated (in 2 steps): Repeated matrix powering 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15

Definition: Let for some scalar, vector, matrix Then Fact: is an eigenvector, and is its eigenvalue If is symmetric ( ) (in our case and are symmetric) Then has orthogonal unit eigenvectors 1 that form a basis (coordinate system) with eigenvalues 1 ( ) 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16

Let s write in coordinate system 1 has coordinates ( 1 ) Suppose: 1 ( 1 ) Then: As, if we normalize (contribution of all other coordinates 0) So, authority is eigenvector of lim associated with largest eigenvalue 1 Similarly: hub is eigenvector of 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17

Still the same idea: Links as votes Page is more important if it has more links In coming links? Out going links? Think of in links as votes: www.stanford.edu has 23,400 in links www.joe schmoe.com has 1 in link Are all in links are equal? Links from important pages count more Recursive question! 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19

A vote from an important page is worth more A page is important if it is pointed to by other important pages Define a rank r j for node j r j i j r i d i out-degree of node a/2 a The web in 1839 a/2 y/2 y y/2 m Flow equations: r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 m 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20

Stochastic adjacency matrix Let page has out links else is a column stochastic matrix Columns sum to 1 If, then Rank vector : vector with an entry per page is the importance score of page The flow equations can be written r j r i i j d i 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21

Imagine a random web surfer: At any time, surfer is on some page At time, the surfer follows an out link from uniformly at random Ends up on some page linked from Process repeats indefinitely Let: vector whose th coordinate is the prob. that the surfer is at page at time So, is a probability distribution over pages r j i 1 i 2 i 3 j i j d out ri (i) 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22

Where is the surfer at time t+1? Follows a link uniformly at random Suppose the random walk reaches a state i 1 i 2 i 3 j p( t 1) M p( t) then is stationary distribution of a random walk Our original rank vector satisfies So, is a stationary distribution for the random walk 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23

Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Assign each node an initial page rank Repeat until convergence ( i r i (t+1) r i (t) < ) Calculate the page rank of each node r ( t1) j i j r ( t) i d i. out-degree of node 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24

Power Iteration: Set And iterate Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25

r ( t1) j i j r ( t) i d i or equivalently r Mr Does this converge? Does it converge to what we want? Are results reasonable? 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26

a b r ( t1) j i j r ( t) i d i Example: r a 1 0 1 0 = r b 0 1 0 1 Iteration 0, 1, 2, 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27

a b r ( t1) j i j r ( t) i d i Example: r a 1 0 0 0 = r b 0 1 0 0 Iteration 0, 1, 2, 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28

2 problems: (1) Some pages are dead ends (have no out links) Such pages cause importance to leak out (2) Spider traps (all out links are within the group) Eventually spider traps absorb all importance 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29

Power Iteration: y y a m y ½ ½ 0 Set And iterate a m a ½ 0 0 m 0 ½ 1 r y = r y /2 + r a /2 r a = r y /2 Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, r m = r a /2 + r m 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30

The Google solution for spider traps: At each time step, the random surfer has two options With prob., follow a link at random With prob. 1, jump to some page uniformly at random Common values for are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps y y a m a m 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31

Power Iteration: Set And iterate Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, a y m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 r m = r a /2 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32

Teleports: Follow random teleport links with probability 1.0 from dead ends Adjust matrix accordingly y y a m a m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 y a m y ½ ½ ⅓ a ½ 0 ⅓ m 0 ½ ⅓ 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33

Google s solution: At each step, random surfer has two options: With probability 1-, follow a link at random With probability, jump to some random page PageRank equation [Brin Page, 98] The above formulation assumes that has no dead ends. We can either preprocess matrix (bad!) or explicitly follow random teleport links with probability 1.0 from dead-ends. See P. Berkhin, A Survey on PageRank Computing, Internet Mathematics, 2005. d i out degree of node i 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34

PageRank as a principal eigenvector or equivalently But we really want: Let s define: d i out degree of node i Now we get what we want: What is? In practice 0.15 (5 links and jump) Note: is a sparse matrix but is dense (all entries 0). In practice we never materialize but rather we use the sum formulation 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35

Input: and Adjacency matrix of a directed graph with spider traps and dead ends Parameter Output: PageRank vector Set: Repeat until: Now re insert the leaked PageRank: See P. Berkhin, A Survey on PageRank Computing, Internet Mathematics, 2005., if in deg. of is 0 then Where: 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 36

11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37

PageRank and HITS are two solutions to the same problem: What is the value of an in link from u to v? In the PageRank model, the value of the link depends on the links into u In the HITS model, it depends on the value of the other links out of u The destinies of PageRank and HITS post 1998 were very different 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 38

Goal: Evaluate pages not just by popularity but by how close they are to the topic Teleporting can go to: Any page with equal probability (we used this so far) A topic specific set of relevant pages Topic specific (personalized) PageRank (S...teleport set) if otherwise Useful for measuring proximity of other nodes to 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 40

Graphs and web search: Ranks nodes by importance Personalized PageRank: Ranks proximity of nodes to the teleport nodes Proximity on graphs: Q: What is most related conference to ICDM? Random Walks with Restarts Teleport back to the starting node: S = { single node } IJCAI KDD ICDM SDM AAAI NIPS Conference Philip S. Yu Ning Zhong R. Ramakrishnan M. Jordan Author 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 41

Link Farms: networks of millions of pages design to focus PageRank on a few undeserving webpages To minimize their influence use a teleport set of trusted webpages E.g., homepages of universities 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 42

11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 44

Rich get richer [Cho et al., WWW 04] Two snapshots of the web graph at two different time points Measure the change: In the number of in links PageRank http://oak.cs.ucla.edu/~cho/papers/cho-bias.pdf 11/6/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 45