Information Retrieval. Lecture 11 - Link analysis

Similar documents
Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

COMP Page Rank

Link Analysis. Hongning Wang

Link Analysis and Web Search

Lecture 8: Linkage algorithms and web search

COMP 4601 Hubs and Authorities

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

PageRank and related algorithms

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Introduction to Information Retrieval

CS6200 Information Retreival. The WebGraph. July 13, 2015

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Link Structure Analysis

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Social Network Analysis

Lec 8: Adaptive Information Retrieval 2

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

TODAY S LECTURE HYPERTEXT AND

Lecture 8: Linkage algorithms and web search

Multimedia Content Management: Link Analysis. Ralf Moeller Hamburg Univ. of Technology

Information Retrieval and Web Search Engines

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

Recent Researches on Web Page Ranking

Information Retrieval. Lecture 9 - Web search basics

Mining Web Data. Lijun Zhang

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

PV211: Introduction to Information Retrieval

COMP5331: Knowledge Discovery and Data Mining

Information Networks: PageRank

Brief (non-technical) history

Lecture 17 November 7

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

CS6322: Information Retrieval Sanda Harabagiu. Lecture 10: Link analysis

Mining Web Data. Lijun Zhang

Information Retrieval. Lecture 10 - Web crawling

Introduction to Data Mining

Information Retrieval and Web Search

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Bruno Martins. 1 st Semester 2012/2013

Slides based on those in:

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Part 1: Link Analysis & Page Rank

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Big Data Analytics CSCI 4030

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Administrative. Web crawlers. Web Crawlers and Link Analysis!

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Information retrieval. Lecture 9

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

Searching the Web [Arasu 01]

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS425: Algorithms for Web Scale Data

Collaborative filtering based on a random walk model on a graph

Advanced Computer Architecture: A Google Search Engine

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

This lecture. CS276B Text Retrieval and Mining Winter Pagerank: Issues and Variants. Influencing PageRank ( Personalization )

How to organize the Web?

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture November, 2002

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck

Introduction In to Info Inf rmation o Ret Re r t ie i v e a v l a LINK ANALYSIS 1

Unit VIII. Chapter 9. Link Analysis

Link Analysis in Web Mining

Mining The Web. Anwar Alhenshiri (PhD)

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Page Rank Algorithm. May 12, Abstract

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff Dr Ahmed Rafea

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

Analysis of Large Graphs: TrustRank and WebSpam

Learning to Rank Networked Entities

The application of Randomized HITS algorithm in the fund trading network

Information Retrieval

Information Retrieval. Lecture 4: Search engines and linkage algorithms

CS 6604: Data Mining Large Networks and Time-Series

Information retrieval

Extracting Information from Complex Networks

Using Spam Farm to Boost PageRank p. 1/2

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Information Retrieval

Information Retrieval

Transcription:

Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35

Introduction Link analysis: using hyperlinks for ranking web search results Link analysis is only one of the factors used by search engines to compute a score on a given query Note that counting in-links is not enough (cf spam links) Link analysis is comparable to citation analysis (authority of a paper amount of citations) 2/ 35

Overview Recall: web as a graph PageRank Topic-specific PageRank Hubs and authorities 3/ 35

Recall: web as a graph Recall: web as a graph... Page A Page B <a href= URL B >text 1</a>... <a href= URL C >text 2</a> Page C... 4/ 35

Recall: web as a graph Recall: web as a graph (continued) 2 observations: (a) anchor text pointing to a page B is a good description of page B (b) hyperlink from page A to page B is an endorsement of page B (a) helps for indexing pages that do not contain the terms people usually use to refer to them, and also for indexing images and other specific content (a) can be used conjoinly with the analysis of a window of terms surrounding the anchor (b) suggests that observing the distribution of links within the web can help finding the most relevant pages 5/ 35

Overview PageRank Recall: web as a graph PageRank About PageRank Markov chains Markov chains and the web Computing the PageRank score Topic-specific PageRank Hubs and authorities 6/ 35

About PageRank PageRank About PageRank PageRank: scoring measure based only on the link structure of web pages Every node in the web graph is given a score between 0 and 1, depending on its in and out-links Given a query, a search engine combines the PageRank score with other values to compute the ranked retrieval (e.g. cosine similarity, relevance feedback, etc.) 7/ 35

PageRank About PageRank (continued) About PageRank Underlying idea: computing a score reflecting the visitability of a page When surfing the web, a user may visit some pages more often than others (e.g. more in-links) Pages that are often visited are more likely to contain relevant information When a dead-end page is reached, the user may teleport (e.g. type an address in the browser) NB: the teleport operation consists of a uniform choice at random within the nodes of the web graph 8/ 35

PageRank About PageRank Assigning a PageRank score Based on a traversal of the web graph: 1. When a page has no out-links, the user teleports 2. When a page has out-links, the user may teleport with a probability α (0 α 1) In the second case, the probability for the user to click on an out-link is 1 α (α is generally set to 0.1) When the surfer follows this schema for a certain time, he visits each node v a fixed fraction of time π(v) that depends on both the structure of the graph and α π(v) PageRank of v 9/ 35

Overview PageRank Markov chains Recall: web as a graph PageRank About PageRank Markov chains Markov chains and the web Computing the PageRank score Topic-specific PageRank Hubs and authorities 10/ 35

Markov chains PageRank Markov chains Discrete stochastic process, corresponding to a list of steps at which a random choice is make Can be characterized by an N N transition probability matrix P, where: 0 P i,j(1 i,j N) 1 N P ij = 1 i [1..N] j=1 P ij gives the probability, being at time t in step i, to be in step j at time t + 1 (called transition probability) Note that there is no memory (i.e. the probability only depends on the current step, not the previous ones) 11/ 35

PageRank Markov chains Markov chains (continued) Property of such stochastic matrices: they have a principal left eigenvector for the largest eigenvalue 1 Recall: an vector v is an eigenvector for a matrix M iff M. v = λ. v (λ is the eigenvalue for the eigenvector v) An eigenmatrix can be decomposed as follows: M = QΛQ 1 where: Q is such that Q i is the i th eigenvector Λ is a diagonal matrix, and Λ ii is the i th eigenvalue Q 1 is the left eigenvector and Λ 11 the corresponding eigenvalue 12/ 35

PageRank Markov chains: example Markov chains Example from (Manning et al., 2008) 13/ 35

PageRank Markov chains: example Markov chains Example from (Manning et al., 2008) A : B : C : A : 0 0.5 0.5 B : 1 0 0 C : 1 0 0 13/ 35

Probability vectors PageRank Markov chains A probibility vector v is such that: v i(1 i N) [0, 1] N v i = 1 i=1 A probability vector v = (v 1...v N ) defines a state in the chain v = ( 1 0... i 1... N 0) refers to state i More generally, the component v i of a probability vector defines the probability to be in state i If v is the probability of the current step, the probability of the next step is v.p 14/ 35

PageRank Markov chains Ergodic Markov chains A Markov chain is said to be ergodic when: T 0 R + such that i, j [1..N] t > T 0 the probability to be in state j is 0 To be ergodic, a Markov chain needs to have 2 properties: irreductability: for all states i, j there is a sequence of transitions from i to j with non-zero probability aperiodicity: no partition of the states into sets, from which only cycles are defined 15/ 35

PageRank Markov chains Property of an ergodic Markov chain For any ergodic Markov chain, there exists a unique probability vector π which is the principal left eigenvector of the probability matrix P, and such that, if we note η(i, t) the number of visits to the state j in t steps: η(i, t) lim t + t = π i > 0 π i is the steady-state probability for state i 16/ 35

Overview PageRank Markov chains and the web Recall: web as a graph PageRank About PageRank Markov chains Markov chains and the web Computing the PageRank score Topic-specific PageRank Hubs and authorities 17/ 35

PageRank Markov chains and the web Markov chains and the web A random web surf can be seen as a Markov chain Considering the adjacency matrix A such that, i, j [1..N]: { 1 iff there is a link from page i to page j A ij = 0 otherwise The N N probability matrix P of a web surf is built using the following algorithm: 1) each 1 in A is divided by the number of 1 in its row 2) the resulting matrix is multiplied by 1 α α 3) is added to every entry of the resulting matrix (for N teleport probability) the resulting matrix is P 18/ 35

PageRank Markov chains and the web Markov chains and the web: example Considering the following adjacency matrix: 0 1 0 1 0 1 0 1 0 Compute the probability matrix, for a teleport operation of probability α = 0.5 Example from (Manning et al., 2008) 19/ 35

PageRank Markov chains and the web Markov chains and the web (continued) In case of a long traversal of the web (i.e. of the Markov chain), each state is visited at a different frequency If we consider the Markov chain representing the web to be ergodic: this frequency of visits converges to a fixed steady-state quantity the PageRank score 20/ 35

Outline PageRank Computing the PageRank score Recall: web as a graph PageRank About PageRank Markov chains Markov chains and the web Computing the PageRank score Topic-specific PageRank Hubs and authorities 21/ 35

PageRank Computing the PageRank score Computing the PageRank score One way to compute the PageRank score is to use the power iteration method, based on the following remark: if π is a steady-state distribution and P the probability matrix, we have: π = λ. π In other terms, π is the left eigenvector of P, whose eigenvalue is 1: π = 1. π 22/ 35

PageRank Computing the PageRank score Computing the PageRank score (continued) Wherever we start, after some iterations, we reach the steady state π If the initial probability vector (initial step) is x, after one step, we are in x.p after two steps, we are in x.p 2 and so on. For a large k, x.p k = π 23/ 35

PageRank Computing the PageRank score Computing the PageRank score: example If we consider the following probability matrix: 1/6 2/3 1/6 P = 5/12 1/6 5/12 1/6 2/3 1/6 If the surf begins at state 1, compute the 3 first transitions Note that the steady-state vector is reached after several iterations and is: π = (5/18 4/9 5/18) Example from (Manning et al., 2008) 24/ 35

PageRank Computing the PageRank score Remarks about the PageRank score The PageRank score is independent from any query static quality measure of a web page Thus the Pagerank score is pre-processed by (i) building the probability matrix for a graph of web pages, and (ii) computing its steady state (iteratively or not) At run-time, the query is processed to retrieve a given amount of pages, which are then ranked using the query-independent PageRank score Note hat the Markov model presented here do not take the back button into account 25/ 35

Outline Topic-specific PageRank Recall: web as a graph PageRank Topic-specific PageRank Hubs and authorities 26/ 35

Topic-specific PageRank Topic-specific PageRank Relies on the fact that the teleport operation cannot considered to be chosen at random uniformly Some pages are more likely to be entered in the browser than others Idea: the teleport operation is chosen at random uniformly within a given topic These topics are defines via either: a manually built directory of pages (e.g. open directory project http://www.dmoz.org) an automatic text classification algorithm 27/ 35

Topic-specific PageRank Topic-specific PageRank (continued) Use a personalized PageRank score Idea: Approximation of the interests of the search user as a linear combination of a small number of topics Each elementary PageRank score (one per topic) is pre-computed considering a teleport operation within this given topic Then, the personalized PageRank for a given users is computed on the fly as a linear combination of the elementary PageRank scores Example: user interested mainly in sports (60%) and health (40%) π = 0.6π sport + 0.4π health 28/ 35

Outline Hubs and authorities Recall: web as a graph PageRank Topic-specific PageRank Hubs and authorities 29/ 35

Hubs and authorities Hubs and authorities Underlying idea: there are 2 main kinds of useful pages for broad-topic searches authoritative sources of information or authorities (e.g. medical research institute) hand-compiled lists of authoritative sources or hubs (e.g. association promoting health-care) Basic property of such pages: good hubs points to many good authorities good authorities are pointed by many hubs In this context, given a query, web pages will be given 2 scores: a hub score and a authority score 30/ 35

Hubs and authorities Hubs and authorities (continued) Iterative computation of these scores: Starting point: a subset S of good hubs and authorities All nodes v are given the scores Iteration: h(v) v y h(v) = a(v) = 1 a(y) a(v) y v h(y) Using the adjacency matrix A, this can ber expressed as: Thus: h = A. a h = A.A T h a = A T. h a = A T.A a 31/ 35

Hubs and authorities Hubs and authorities (continued) To sum up: 1) gather a subset of web pages 2) compute the adjacency matrix A, A.A T and A T.A 3) compute the left eigenvectors of A.A T and A T.A which are respectively h and a How to select the starting subset? a) given a query, use a text index to get all pages containing the terms root set b) add all pages that either point to or are pointed by pages of the root set base set 32/ 35

Hubs and authorities Hubs and authorities (continued) Method known as Hyperlink-Induced Topic Search (HITS) Top hubs and authorities include other languages than those of the query (cross-language retrieval) 200 pages are enough for the root set 5 iterations are usually enough to compute the top hubs and authorities We are more interested in relative scores than absolute ones, thus during iterations a and h can be scaled down In practice, addictive updates are used rather than matrix products (time complexity) Main issues: off-topic authorities (e.g. super-topic) bias via affiliated web pages 33/ 35

Conclusion Hubs and authorities Link analysis is used to guide the crawling of the web and to give a static score to a web page This static score is one of the components of the final score a page gets for a given query The PageRank algorithm relies on Markov chains to compute a score corresponding to a steady state in terms of frequency of visits of a page during a web surf The HITS algorithm is used for broad-topic searches and computes a score from the idea that reliable hubs connect reliable authorities 34/ 35

Hubs and authorities References C. Manning, P. Raghavan and H. Schütze Introduction to Information Retrieval http://nlp.stanford.edu/ir-book/pdf/ chapter21-linkanalysis.pdf Lawrence Page and Sergey Brin The PageRank Citation Ranking: Bringing Order to the Web (1998) http://citeseer.ist.psu.edu/page98pagerank.html Jon Kleinberg Authoritative Sources in a Hyperlinked Environment (1999) http://citeseer.ist.psu.edu/ kleinberg99authoritative.html 35/ 35