Link Analysis. Hongning Wang

Similar documents
Information Retrieval. Lecture 11 - Link analysis

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

PageRank and related algorithms

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Lecture 8: Linkage algorithms and web search

Link Analysis and Web Search

COMP5331: Knowledge Discovery and Data Mining

CS6200 Information Retreival. The WebGraph. July 13, 2015

COMP 4601 Hubs and Authorities

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Link Structure Analysis

COMP Page Rank

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Information Networks: PageRank

Introduction to Information Retrieval

The application of Randomized HITS algorithm in the fund trading network

Lecture 17 November 7

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Information Retrieval and Web Search

Mining Web Data. Lijun Zhang

TODAY S LECTURE HYPERTEXT AND

How to organize the Web?

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Lec 8: Adaptive Information Retrieval 2

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

Web Structure Mining using Link Analysis Algorithms

Adaptive methods for the computation of PageRank

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Bruno Martins. 1 st Semester 2012/2013

Slides based on those in:

Brief (non-technical) history

Big Data Analytics CSCI 4030

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Searching the Web [Arasu 01]

Lecture 8: Linkage algorithms and web search

Part 1: Link Analysis & Page Rank

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Information Retrieval and Web Search Engines

Big Data Analytics CSCI 4030

Recent Researches on Web Page Ranking

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

CS 6604: Data Mining Large Networks and Time-Series

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

On Finding Power Method in Spreading Activation Search

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Extractive Text Summarization Techniques

PV211: Introduction to Information Retrieval

CS425: Algorithms for Web Scale Data

Motivation. Motivation

CS60092: Informa0on Retrieval

Mining Web Data. Lijun Zhang

SimRank : A Measure of Structural-Context Similarity

2013/2/12 EVOLVING GRAPH. Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Yanzhao Yang

Introduction to Data Mining

Multimedia Content Management: Link Analysis. Ralf Moeller Hamburg Univ. of Technology

Introduction to Data Mining

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff Dr Ahmed Rafea

Social Network Analysis

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

Collaborative filtering based on a random walk model on a graph

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

The PageRank Citation Ranking: Bringing Order to the Web

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

Finding Top UI/UX Design Talent on Adobe Behance

Link Analysis in Web Mining

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Advanced Computer Architecture: A Google Search Engine

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Analysis of Large Graphs: TrustRank and WebSpam

Information Retrieval

Unit VIII. Chapter 9. Link Analysis

A Reordering for the PageRank problem

CS6322: Information Retrieval Sanda Harabagiu. Lecture 10: Link analysis

How Google Finds Your Needle in the Web's

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

Learning to Rank Networked Entities

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

A brief history of Google

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

An Application of Personalized PageRank Vectors: Personalized Search Engine

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

Authoritative Sources in a Hyperlinked Environment

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

Page Rank Algorithm. May 12, Abstract

Transcription:

Link Analysis Hongning Wang CS@UVa

Structured v.s. unstructured data Our claim before IR v.s. DB = unstructured data v.s. structured data As a result, we have assumed Document = a sequence of words Query = a short document Corpus = a set of documents However, this assumption is not accurate CS@UVa CS 4501: Information Retrieval 2

A typical web document has Title Anchor Body CS@UVa CS 4501: Information Retrieval 3

How does a human perceive a document s structure CS@UVa CS 4501: Information Retrieval 4

Intra-document structures Document Title Paragraph 1 Concise summary of the document Likely to be an abstract of the document Paragraph 2.. Images Anchor texts They might contribute differently for a document s relevance! Visual description of the document References to other documents CS@UVa CS 4501: Information Retrieval 5

Exploring intra-document structures for retrieval Document Title Paragraph 1 Paragraph 2.. Anchor texts Intuitively, we want to give different weights to the parts to reflect their importance In vector space model? Select D j and generate a query word using D j pq ( DR, ) = pw ( DR, ) = n i= 1 n i= 1 k j= 1 i Weighted TF Think about query-likelihood model sd ( DRpw, ) ( D, R) j i j part selection prob. Serves as weight for D j Can be estimated by EM or manually set CS@UVa CS 4501: Information Retrieval 6

Inter-document structure Documents are no longer independent Source: https://wiki.digitalmethods.net/dmi/wikipediaanalysis CS@UVa CS 4501: Information Retrieval 7

What do the links tell us? Anchor Rendered form Original form CS@UVa CS 4501: Information Retrieval 8

What do the links tell us? Anchor text How others describe the page E.g., big blue is a nick name of IBM, but never found on IBM s official web site A good source for query expansion, or can be directly put into index CS@UVa CS 4501: Information Retrieval 9

What do the links tell us? Linkage relation Endorsement from others utility of the page "PageRank-hi-res". Licensed under Creative Commons Attribution-Share Alike 2.5 via Wikimedia Commons - http://commons.wikimedia.org/wiki/file:pagerank-hi-res.png#mediaviewer/file:pagerank-hi-res.png CS@UVa CS 4501: Information Retrieval 10

Analogy to citation network Authors cite others work because A conferral of authority They appreciate the intellectual value in that paper There is certain relationship between the papers Bibliometrics A citation is a vote for the usefulness of that paper Citation count indicates the quality of the paper E.g., # of in-links CS@UVa CS 4501: Information Retrieval 11

Situation becomes more complicated in the web environment Adding a hyperlink costs almost nothing Taken advantage by web spammers Large volume of machine-generated pages to artificially increase in-links of the target page Fake or invisible links We should not only consider the count of inlinks, but the quality of each in-link PageRank HITS CS@UVa CS 4501: Information Retrieval 12

Link structure analysis Describes the characteristic of network structure Reflect the utility of web documents in a general sense An important factor when ranking documents For learning-to-rank For focused crawling CS@UVa CS 4501: Information Retrieval 13

Recall how we do internet browsing 1. Mike types a URL address in his Chrome s URL bar; 2. He browses the content of the page, and follows the link he is interested in; 3. When he feels the current page is not interesting or there is no link to follow, he types another URL and starts browsing from there; 4. He repeats 2 and 3 until he is tired or satisfied with this browsing activity CS@UVa CS 4501: Information Retrieval 14

PageRank A random surfing model of internet 1. A surfer begins at a random page on the web and starts random walk on the graph 2. On current page, the surfer uniformly follows an out-link to the next page 3. When there is no out-link, the surfer uniformly jumps to a page from the whole page 4. Keep doing Step 2 and 3 forever CS@UVa CS 4501: Information Retrieval 15

PageRank A measure of web page popularity Probability of a random surfer who arrives at this web page Only depends on the linkage structure of web pages d 4 d 1 d 3 MM = d 2 Transition matrix 0 0 0 0 0 1 0.5 0.5 0.5 0.5 0 0 0 0 0 0 pp tt dd = ααmm TT pp tt 1 dd + 1 αα NN Random walk α: probability of random jump N: # of pages pp tt 1 dd CS@UVa CS 4501: Information Retrieval 16

Theoretic model of PageRank Markov chains A discrete-time stochastic process It occurs in a series of time-steps in each of which a random choice is made Can be described by a directed graph or a transition matrix P(So-so Cheerful)=0.2 MM = 0.6 0.2 0.2 0.3 0.4 0.3 0 0.3 0.7 CS@UVa CS 4501: Information Retrieval 17 A first-order Markov chain for emotion

Markov chains Markov property PP XX nn+1 XX 1,, XX nn = PP(XX nn+1 XX nn ) Memoryless (first-order) Transition matrix A stochastic matrix ii, jj MM iiii = 1 Key property MM = Idea of random surfing 0.6 0.2 0.2 0.3 0.4 0.3 0 0.3 0.7 It has a principal left eigenvector corresponding to its largest eigenvalue, which is one Mathematical interpretation of PageRank score CS@UVa CS 4501: Information Retrieval 18

Theoretic model of PageRank Transition matrix of a Markov chain for PageRank d 1 d 3 AA = d 2 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1. Enable random jump on dead end AAA = 0 0 0.25 0.25 0 1 1 1 1 1 0.25 0.25 0 0 0 0 MM = d 4 0.125 0.125 0.25 0.25 0.125 0.625 0.375 0.375 0.375 0.375 0.25 0.25 0.125 0.125 0.125 0.125 3. Enable random jump on all nodes αα = 0.5 AAAA = 2. Normalization 0 0 0.25 0.25 0 1 0.5 0.5 0.5 0.5 0.25 0 0.25 0 0 0 CS@UVa CS 4501: Information Retrieval 19

Steps to derive transition matrix for PageRank 1. If a row of A has no 1 s, replace each element by 1/N. 2. Divide each 1 in A by the number of 1 s in its row. 3. Multiply the resulting matrix by 1 α. 4. Add α/n to every entry of the resulting matrix, to obtain M. A: adjacent matrix of network structure; α: dumping factor CS@UVa CS 4501: Information Retrieval 20

PageRank computation becomes pp tt dd = MM TT pp tt 1 dd Assuming pp 0 dd = 1 NN,, 1 NN Iterative computation (forever?) pp tt dd = MM TT pp tt 1 dd = = (MM TT ) tt pp 0 dd Intuition: after enough rounds of random walk, each dimension of pp tt dd indicates the frequency of a random surfer visiting document dd Question: will this frequency converges to certain fixed, steady-state quantity? CS@UVa CS 4501: Information Retrieval 21

Stationary distribution of a Markov chain For a given Markov chain with transition matrix M, its stationary distribution of π is ii SS, ππ ii 0 ii SS ππ ii = 1 A probability vector Random walk does not affect its distribution ππ = MM TT ππ Necessary condition Irreducible: a state is reachable from any other state Aperiodic: states cannot be partitioned such that transitions happened periodically among the partitions CS@UVa CS 4501: Information Retrieval 22

Markov chain for PageRank Random jump operation makes PageRank satisfy the necessary conditions 1. Random jump makes every node is reachable for the other nodes 2. Random jump breaks potential loop in a subnetwork What does PageRank score really converge to? CS@UVa CS 4501: Information Retrieval 23

Stationary distribution of PageRank For any irreducible and aperiodic Markov chain, there is a unique steady-state probability vector π, such that if cc(ii, tt) is the number of visits to state ii after tt steps, then lim tt cc ii, tt tt = ππ ii PageRank score converges to the expected visit frequency of each node CS@UVa CS 4501: Information Retrieval 24

Computation of PageRank Power iteration pp tt dd = MM TT pp tt 1 dd = = (MM TT ) tt pp 0 (dd) Normalize pp tt dd in each iteration Convergence rate is determined by the second eigenvalue Random walk becomes series of matrix production Alternative interpretation of PageRank score Principal left eigenvector corresponding to its largest eigenvalue, which is one MM TT ππ = 1 ππ CS@UVa CS 4501: Information Retrieval 25

Computation of PageRank An example from Manning s text book MM = 1/6 2/3 1/6 5/12 1/6 5/12 1/6 2/3 1/6 CS@UVa CS 4501: Information Retrieval 26

Variants of PageRank Topic-specific PageRank Control the random jump to topic-specific nodes E.g., surfer interests in Sports will only randomly jump to Sports-related website when they have no out-links to follow pp tt dd = [ααmm TT + 1 αα eepp TT (dd)]pp tt 1 dd pp TT dd > 0 iff d belongs to the topic of interest ee is a column vector of ones CS@UVa CS 4501: Information Retrieval 27

Variants of PageRank Topic-specific PageRank A user s interest is a mixture of topics User s interest: 60% Sports, 40% politics Damping factor: 10% Compute it off-line ππ = pp(tt kk uuuuuuuu) ππ TTkk Manning, Introduction to Information Retrieval, Chapter 21, Figure 21.5 CS@UVa CS 4501: Information Retrieval 28 kk

Variants of PageRank LexRank A sentence is important if it is similar to other important sentences PageRank on sentence similarity graph Centrality-based sentence salience ranking for document summarization CS@UVa CS 4501: Information Retrieval 29 Erkan & Radev, JAIR 04

Variants of PageRank SimRank Two objects are similar if they are referenced by similar objects PageRank on bipartite graph of object relations Measure similarity between objects via their connecting relation Glen & Widom, KDD'02 CS@UVa CS 4501: Information Retrieval 30

HITS algorithm Two types of web pages for a broad-topic query Authorities trustful source of information UVa-> University of Virginia official site Hubs hand-crafted list of links to authority pages for a specific topic Deep learning -> deep learning reading list CS@UVa CS 4501: Information Retrieval 31

HITS algorithm Intuition HITS=Hyperlink-Induced Topic Search Using hub pages to discover authority pages Assumption A good hub page is one that points to many good authorities -> a hub score A good authority page is one that is pointed to by many good hub pages -> an authority score Recursive definition indicates iterative algorithm CS@UVa CS 4501: Information Retrieval 32

Computation of HITS scores Two scores for a web page for a given query Authority score: aa dd Hub score: h(dd) aa dd vv dd h(vv) vv dd means there is a link from vv to dd h dd aa(vv) Important HITS scores are query-dependent! dd vv With proper normalization (L 2 -norm) CS@UVa CS 4501: Information Retrieval 33

Computation of HITS scores In matrix form aa AA TT h and h AA aa That is aa AA TT AA aa and h AAAA TT h Another eigen-system aa = 1 λλ aa AA TT AA aa Power iteration is applicable here as well h = 1 λλ h AAAA TT h CS@UVa CS 4501: Information Retrieval 34

Constructing the adjacent matrix Only consider a subset of the Web 1. For a given query, retrieve all the documents containing the query (or top K documents in a ranked list) root set 2. Expand the root set by adding pages either linking to a page in the root set, or being linked to by a page in the root set base set 3. Build adjacent matrix of pages in the base set CS@UVa CS 4501: Information Retrieval 35

Constructing the adjacent matrix Reasons behind the construction steps Reduce the computation cost A good authority page may not contain the query text The expansion of root set might introduce good hubs and authorities into the sub-network CS@UVa CS 4501: Information Retrieval 36

Sample results Kleinberg, JACM'99 Manning, Introduction to Information Retrieval, Chapter 21, Figure 21.6 CS@UVa CS 4501: Information Retrieval 37

Today s reading Introduction to information retrieval Chapter 21: Link Analysis CS@UVa CS 4501: Information Retrieval 38

References Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. "The PageRank citation ranking: Bringing order to the web." (1999) Haveliwala, Taher H. "Topic-sensitive pagerank." In Proceedings of the 11th international conference on World Wide Web, pp. 517-526. ACM, 2002. Erkan, Günes, and Dragomir R. Radev. "LexRank: Graph-based lexical centrality as salience in text summarization." J. Artif. Intell. Res.(JAIR) 22, no. 1 (2004): 457-479. Jeh, Glen, and Jennifer Widom. "SimRank: a measure of structuralcontext similarity." In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 538-543. ACM, 2002. Kleinberg, Jon M. "Authoritative sources in a hyperlinked environment." Journal of the ACM (JACM) 46, no. 5 (1999): 604-632. CS@UVa CS 4501: Information Retrieval 39