Authoritative Sources in a Hyperlinked Environment

Similar documents
COMP5331: Knowledge Discovery and Data Mining

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Authoritative Sources in a Hyperlinked Environment

Lecture 17 November 7

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Bibliometrics: Citation Analysis

Link Analysis and Web Search

PageRank and related algorithms

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff Dr Ahmed Rafea

Social Network Analysis

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Information Retrieval. Lecture 4: Search engines and linkage algorithms

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

CSI 445/660 Part 10 (Link Analysis and Web Search)

CS6200 Information Retreival. The WebGraph. July 13, 2015

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Information Networks: Hubs and Authorities

Recent Researches on Web Page Ranking

Information Retrieval and Web Search

Abstract. 1. Introduction

Information Retrieval. Lecture 11 - Link analysis

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

CS 6604: Data Mining Large Networks and Time-Series

Analysis and Improvement of HITS Algorithm for Detecting Web Communities

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Information Retrieval and Web Search Engines

Web Mining Team 11 Professor Anita Wasilewska CSE 634 : Data Mining Concepts and Techniques

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Lec 8: Adaptive Information Retrieval 2

CHAPTER-27 Mining the World Wide Web

Link Analysis. Hongning Wang

Extracting Information from Complex Networks

Bruno Martins. 1 st Semester 2012/2013

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Web Structure Mining using Link Analysis Algorithms

Approaches to Mining the Web

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Information Retrieval Spring Web retrieval

Survey on Web Structure Mining

Mining Web Data. Lijun Zhang

COMP 4601 Hubs and Authorities

A P2P-based Incremental Web Ranking Algorithm

TODAY S LECTURE HYPERTEXT AND

Improving the Ranking Capability of the Hyperlink Based Search Engines Using Heuristic Approach

Searching the Web [Arasu 01]

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

The application of Randomized HITS algorithm in the fund trading network

Searching the Web What is this Page Known for? Luis De Alba

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM

Social Information Filtering

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

Part 1: Link Analysis & Page Rank

Graph-Based Algorithms in NLP

Mining Web using Hyper Induced Topic Search Algorithm

Analytical survey of Web Page Rank Algorithm

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Motivation. Motivation

ABSTRACT. The purpose of this project was to improve the Hypertext-Induced Topic

Ranking of nodes of networks taking into account the power function of its weight of connections

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Review of Various Web Page Ranking Algorithms in Web Structure Mining

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

Chapter 3: Google Penguin, Panda, & Hummingbird

Purna Prasad Mutyala et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011,

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

Joint Entity Resolution

Link Structure Analysis

Experimental study of Web Page Ranking Algorithms

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

A ew Algorithm for Community Identification in Linked Data

A Study on Web Structure Mining

Information Retrieval May 15. Web retrieval

ECLT 5810 Clustering

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

Optimizing Search Engines using Click-through Data

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Predict Topic Trend in Blogosphere

TEA September, 2003 Universidad Autónoma de Madrid. Finding Hubs for Personalised Web Search. Different ranks to different users

ECLT 5810 Clustering

Review: Searching the Web [Arasu 2001]

Finding Authoritative Web Pages Using the Web's Link Structure (David Wagner)

Web Document Clustering using Semantic Link Analysis

Weighted PageRank using the Rank Improvement

DATA MINING - 1DL105, 1DL111

What is this Page Known for? Computing Web Page Reputations. Davood Rafiei, Alberto Mendelzon University of Toronto

D. BOONE/CORBIS. Structural Web Search Using a Graph-Based Discovery System. 20 Spring 2001 intelligence

Finding Hubs and authorities using Information scent to improve the Information Retrieval precision

Web. The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm

Dynamic Visualization of Hubs and Authorities during Web Search

Transcription:

Authoritative Sources in a Hyperlinked Environment Journal of the ACM 46(1999) Jon Kleinberg, Dept. of Computer Science, Cornell University

Introduction Searching on the web is defined as the process of discovering pages relevant to the query Relevance is subjective. Quality metrics would require human intervention Objective functions that concretely define human notions of quality are missing

Authoritative Sources Query type and issues Type Example Issues Specific-topic Who is John Galt? Scarcity problem Broad-topic Works of American authors Abundance problem Similar-page Find similar pages Defining similarity For broad-topic queries, effective search methods need to identify authorative sources Identification of authoritative sources is difficult No apparent endogenous attribute of the page that enables it to be identified as an authority. Natural authorities may not use self-identifying terms on their pages.

A link based approach Paper proposes a link-based model for the conferral of authority. The model is based on the relationship that exists between authorities and hubs - pages linking to authorities The algorithm operates on focused subgraphs of the web producing a small collection of pages that are most likely to contain the authorative sources for a given topic.

Constructing a focused subgraph Graph G(V, E) where V Set of pages A directed edge (p, q) E indicates presence of a link from p to q Out degree(p): No. of nodes that p links to In degree(p) No of nodes that link to p An induced graph on W V is a graph G(W, E ) such that E E and (p, q) E, p W and q W Our goal is to identify a collection of pages S σ with the following traits:- 1 S σ is relatively small 2 S σ is rich in relevant pages 3 S σ contains most of the strongest authorities.

Constructing a focused subgraph (cont.) Collect the t-top ranked pages from a search engine that returns text results. The set of pages is referred to as the root set R σ Although the strong authority may not exist in R σ, it is very likely to be linked to by at least one of the pages in the set R σ The number of strong authorities in the subgraph can therefore be increased by expanding R σ along the edges. The algorithm works in as follows: Set S σ = R σ For each p R σ : Add all pages outgoing from p to S σ Add an arbitrary set of pages d a from incoming links to p to S σ The result is a focussed subgraph typically of the size between 1000-5000 a Limited to a subset since the in degree of p might be very large. In their experiments d was set to 50

Computing hubs and spokes Preprocessing Domain: The first level in the url string Types of links:- Transverse: A link between different domains Intrinsic: A link between pages in the same domain. Intrinsic links are assumed typically navigational and hence contribute less information about authorities. All intrinsic links from G[S σ ] are removed, keeping only the edges corresponding to transverse links. In addition, to account for mass advertisements / collusion among referring pages, that allows only up to m pages from a single domain to point to a any given page p. This heuristic is not used in the experiment. The result graph is denoted by G σ

Computing hubs and spokes Authorative pages would have:- A high in-degree: Lots of pages linking to it A considerable overlap in the set of pages that point to it. These pages are called hubs Hubs are pages pointing to multiple relevant authorative pages. Hubs and authorities exhibit mutually reinforcing relationship - a good hub is a page that points to many authoritative pages and a good authority is one that is pointed to by many good hubs.

Computing hubs and spokes An iterative algorithm to assigning weights to pages Associated with a page p are:- A non-negative authority weight x <p> A non-negative hub weight y <p> The weights are normalized so that p S σ (x <p>2 ) = 1 and p S σ (y <p>2 ) = 1 Two operations on the weights: I: Updates x-weights as x <p> q:(q,p) E y <q> O: Updates y-weights as y <p> q:(q,p) E x <q>

Computing hubs and spokes An iterative algorithm to assigning weights to pages Iterate(G, k): G: A collection of n linked pages k: A constant x 0 = y 0 = 1, 1, 1...1 For i = 1...k Apply I operations to (x i 1, y i 1 ) obtaining new weights x i Apply O operations to (x i, y i 1) obtaining new weights y i Normalize x i and y i to obtain x i and y i respectively End Return (x k, y k )

Computing hubs and spokes Obtaining the top c authorities and hubs Filter(G, k, c) G: A collection of n linked pages k, c: natural numbers (x k, y k ) = Iterate(G, k) The pages with the c largest co-ordinates from x k and y k are the top authorities and hubs. c 5-10 The Iterate procedure converges as k increases arbitrarily. In their experiments k = 20 provided sufficient convergence.

Basic Results Interpretation of the results Besides the initial black-box call to the search engine, the analysis ignored the textual content of the pages. This indicates a considerable amount can be accomplished using a pure link analysis approach. It is possible to reliably estimate certain types of global information about the www using a local method of analysis on a small focussed subgraph. By modifying the root set to begin with R p to constitute t pages that point to p, the algorithm can be made to perform similar-page queries.

Brushing up on some linear algebra Eigen Vectors and Eigen Values An eigen vector is a column vector X [n,1] such that M [n,n] X [n,1] = λx [n,1] e.g. For M = ( 2 0 1 1 ) ( 2 0 1 1 ) 1 0 ) = 2 ( 1 0 ) ( 2 0 1 1 ) ( ) ( 1 1 = 1 1 ) 1 The scalars λ = {2, 1} are called eigen values If the eigen values are ordered on decreasing eigen values i.e. λ i, then the eigen vector corresponding to λ 0 is called the principal eigen vector and the rest are called (surprise!) non-principal eigen vector

Multiple Sets of Hubs & Authorities A number of scenarios where one may be interested in finding several densely linked collections of hubs and authorities. Each collection could be relevant but disconnected from each other:- Ambigious / several meanings: e.g. Java - software platform, island, coffee Highly polarized issues (pages that do not cross link): e.g. abortion Queries that may encompass multiple communities e.g. randomized algorithms Multiple sets of hubs & authorities can be identified by considering non-principal eigen vectors to identify additional densely linked collections of hubs and authorities from the base set S σ

Standings, Impact and Influence Social Networks & Bibliometrics Relative standing of individuals in an implicitly defined social network can be measured by methods similar to the ones discussed for pages so far. Bilbliometrics is the study of written documents and their citation structure. It is considered an important measure of the importance and impact of individual scientific papers. Both these are analogous to the concept of authorities. The problems oculd be modeled as a graph G = (V, E) where an edge (i, j) implies an endorsement of j by i The critical difference between these and link analysis is that in these scenarios, authorities would tend to endore other authorities (e.g. important research papers citing other important papers)

Diffusion & generalization When the initial query string σ specifies a query that is not broad, there will not be enough relevant pages in G σ from which to extract a subgraph of authorities and hubs. Authorative pages competing with broader topics will compete and win out in the algorithm. This is termed as diffusing from the initial query Although, diffusion produces incorrect results for such a class of queries, they are useful to identify a natural generalization of a given topic. This problem is alleviated by the use of textual context in addition to the link structure and was is part of the paper s future work.

Evaluation Yahoo! versus Clever versus AltaVista Clever is an implementation of the link analysis algorithm implemented independently. It improvises on the algorithm by using contextual link text information as additional heuristics. At the time the evaluation was performed, Yahoo! was a hierarchical manually maintained directory and provided the best measure of human judgement of authority. Users were asked to compare the results provided by the three yielding 1369 responses in all Yahoo! and Clever Equivalent 31% Clever better 50% Yahoo! better 19%