What is this Page Known for? Computing Web Page Reputations. Davood Rafiei, Alberto Mendelzon University of Toronto

Similar documents
What is this Page Known for? Computing Web Page. Abstract. The textual content of the Web enriched with the hyperlink structure surrounding

Data (and Links) on the Web. Alberto Mendelzon University of Toronto

What do the Neighbours Think? Computing Web Page Reputations

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Searching the Web What is this Page Known for? Luis De Alba

= a hypertext system which is accessible via internet

COMP5331: Knowledge Discovery and Data Mining

Searching the Web [Arasu 01]

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Graph and Link Mining

COMP 4601 Hubs and Authorities

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Lecture 8: Linkage algorithms and web search

Information Retrieval and Web Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Link Structure Analysis

Authoritative Sources in a Hyperlinked Environment

How to organize the Web?

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Lec 8: Adaptive Information Retrieval 2

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Brief (non-technical) history

1.6 Case Study: Random Surfer

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015

Introduction to Information Retrieval

SOFIA: Social Filtering for Niche Markets

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

On Finding Power Method in Spreading Activation Search

Abstract. 1. Introduction

Recent Researches on Web Page Ranking

An Improved Computation of the PageRank Algorithm 1

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang

Bruno Martins. 1 st Semester 2012/2013

Bibliometrics: Citation Analysis

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

A P2P-based Incremental Web Ranking Algorithm

CS60092: Informa0on Retrieval

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Absorbing Random walks Coverage

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Breadth-First Search Crawling Yields High-Quality Pages

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

ANNA UNIVERSITY :: CHENNAI - 25 TIME TABLE M.C.A. (DISTANCE EDUCATION) DEGREE EXAMINATIONS AUGUST - SEPTEMBER

Social Network Analysis

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

CHAPTER-27 Mining the World Wide Web

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Survey on Web Structure Mining

Introduction to Data Mining

TODAY S LECTURE HYPERTEXT AND

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

QUERY BASED TEXT SUMMARIZATION

Impact of Search Engines on Page Popularity

2013/2/12 EVOLVING GRAPH. Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Yanzhao Yang

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Information Retrieval and Web Search Engines

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

An Enhanced Page Ranking Algorithm Based on Weights and Third level Ranking of the Webpages

Absorbing Random walks Coverage

Part 1: Link Analysis & Page Rank

Motivation. Motivation

COMP Page Rank

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM

Effective Page Refresh Policies for Web Crawlers

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

Lecture 8: Linkage algorithms and web search

HIPRank: Ranking Nodes by Influence Propagation based on authority and hub

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Information retrieval

Link Analysis. Hongning Wang

Authoritative Sources in a Hyperlinked Environment

A novel approach of web search based on community wisdom

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Link Based Clustering of Web Search Results

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Introduction to Machine-Independent Optimizations - 4

Outline. Blended Services Training Session 8: The Web. January 14, Outcomes & competencies. What s on the web.

Blended Services Training Session 8: The Web. January 14, 2010

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Mining The Web. Anwar Alhenshiri (PhD)

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Weighted PageRank using the Rank Improvement

Multimedia Content Management: Link Analysis. Ralf Moeller Hamburg Univ. of Technology

Transcription:

What is this Page Known for? Computing Web Page Reputations Davood Rafiei, Alberto Mendelzon University of Toronto 1

Introduction Ranking plays an important role in searching the Web. But the importance is a subjective measure. A high quality page in computer graphics is not necessarily a high quality page in databases. How do search engines address this problem? 2

Simple Importance Ranking u v y x Rank by in-degree: used in citation analysis (1970s). idea: important journals are frequently cited by other journals. 3

Importance Ranking: PageRank The rank of a page depends on not only the number of its incoming links, but also the ranks of those pages. Adopted by Google search engine. high-ranked pages are returned first. Limitation: each page is assigned a universal rank, independent of its topic. 4

Our Goal Topic Search Engine Pages Page Our System Topics 5

Example What is the page sunsite.unc.edu/javafaq/javafaq.html good for? Java FAQ comp.lang.java FAQ Java Tutorials Java Stuff 6

The Idea search engines compared my favorite search engines p a review of search engines What can we say about the content of Page p? 7

Random Walk Model 1 Imagine a user searching for pages on topic t. The user at each step either jumps into a page on topic t chosen uniformly at random or follows an outgoing link of the current page. The one-level rank of a page on topic t is the number of visits the user makes into the page if the walk goes forever. 8

Random Walk Model 1 d : the fraction of times the user makes a random jump. (1-d) : the fraction of times the user follows a link. N t R n : number of pages on topic t ( p, t) : Prob. of visiting page p for topic t at step n. 9

10 Probability of Visiting a Page p + = otherwise 0 on topict pagep is if ) ( ), ( ) (1 ), ( 1 t p q n n N d O q t q R d t p R q

Second Scenario Good source of links (hub) search engines compared my favorite search engines a review of search engines Good content (authority) p 11

Random Walk Model 2 Imagine the user at each step either jumps into a page on topic t chosen uniformly at random, follows an outgoing link of the current page (forward visit), or jumps into a page that points to the current page (backward visit). The walk strictly alternates between steps 2, 3. The number of forward (backward) visits the user makes into a page is its authority (hub) rank on topic t if the walk goes forever. 12

Random Walk Model 2 d, (1-d), N t : defined similarly. A n ( p, t) : Prob. of a forward visit into page p at step n. H n ( p, t) : Prob. of a backward visit into page p at step n. 13

14 Probability of Visiting a Page + = otherwise 0 on topict pagep is if 2 ) ( ), ( ) (1 ), ( 1 t p q n n N d O q t q H d t p A + = otherwise 0 on topict pagep is if 2 ) ( ), ( ) (1 ), ( 1 t q p n n N d q I t q A d t p H

Rank Computation Done using iterative methods. First iteration: Topics are extracted from the content of pages, Ranks are initialized. Next iterations: Ranks are propagated through hyperlinks. 15

Rank Approximation A given page p can acquire a high rank on an arbitrarily chosen topic t if page p is on topic t, p can be reached within a few steps from a large fraction of pages on topic t, or p can be reached within a few steps from pages with high reputations on topic t. An approximate algorithm will examine page p and only those pages not far away from page p. 16

Computing One-Level Reputation For every page p and term t R(p,t) = 1/ N if term t appears in page p, t R(p,t) = 0 otherwise While R has not converged R1(p,t) = 0 for every page p and term t For every link qp R1(p,t) += R(q,t) / O(q) R(p,t) = (1-d) R1(p,t) for every page p and term t R(p,t) += d/ if term t appears in page p. N t 17

Computing Two-level Reputation For every page p and term t A(p,t) = H(p,t) = 1/2 N t if term t appears in page p, A(p,t) = H(p,t) = 0 otherwise While both H and A have not converged A1(p,t) = H1(p,t) = 0 for every page p and term t For every link qp A1(p,t) += H(q,t) / O(q) H1(q,t) += A(p,t) / I(p) A(p,t) = (1-d) A1(p,t) and H(p,t) = (1-d) H1(p,t) for every page p and term t A(p,t) += d/2 N t and H(p,t) += d/2n t 18 if term t appears in page p.

Current Implementation Given a page, request its incoming links from Alta Vista. Collect the snippets returned by the engine and extract candidate terms and phrases. Remove stop words. Set O(p) = 7.2 for every page p. Initialize the weights and propagate them within one iteration. Return highly-weighted terms/phrases. 19

Example Reputation of www.macleans.ca : 1 - Maclean's Magazine 2 - macleans 3 - Canadian Universities 20

Example: Authorities on (+censorship +net) www.eff.org Anti-Censorship, Join the Blue Ribbon, Blue Ribbon Campaign, Electronic Frontier Foundation www.cdt.org Center for Democracy and Technology, Communications Decency Act, Censorship, Free Speech, Blue Ribbon www.aclu.org ACLU, American Civil Liberties Union, Communications Decency Act 21

Example: Personal Home Pages www.w3.org/people/berners-lee History Of The Internet, Tim Berners-Lee, Internet History, W3C www-db.stanford.edu/~ullman Jeffrey D Ullman, Database Systems, Data Mining, Programming Languages www.cs.toronto.edu/~mendel Alberto Mendelzon, Data Warehousing and OLAP, SIGMOD, DBMS 22

Example: Site Reputation What is this site known for? Russia Computer Vision Images Hockey 23

Example: Site Reputation Reputation of the Faculty of Mathematics, Computer Science, Physics and Astronomy at the University of Amsterdam ( www.wins.uva.nl ): Solaris 2 FAQ Wiskunde Frank Zappa 24

Limitations Our computations are affected by the following two factors: how well is a topic represented on the Web? how well is a page connected? a few pages such as www.microsoft.com have links from a large fraction of all pages on the Web. a large number of pages only have a few incoming links. 25

Conclusions Introduced a notion of reputation combining the textual content and the linkage structure. Duality of Topics and Pages Given a page, we currently find a ranked list of topics for the page. However, given a topic, we can also find a ranked list of pages on that topic. 26

Conclusions Our proposed methods generalize earlier ranking methods One-level reputation ranking generalizes PageRank, Two-level reputation ranking generalizes the hubs-and-authorities model. Ongoing Work: large-scale implementation of the proposed methods. 27