Bruno Martins. 1 st Semester 2012/2013

Similar documents
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Link Analysis and Web Search

Information Networks: PageRank

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Part 1: Link Analysis & Page Rank

How to organize the Web?

Clustering. Bruno Martins. 1 st Semester 2012/2013

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Information Retrieval and Web Search Engines

Information Retrieval. Lecture 11 - Link analysis

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

COMP 4601 Hubs and Authorities

Mining Web Data. Lijun Zhang

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Recent Researches on Web Page Ranking

Mining Web Data. Lijun Zhang

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Classification. 1 o Semestre 2007/2008

Lecture 8: Linkage algorithms and web search

Information Retrieval and Web Search

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

DATA MINING - 1DL105, 1DL111

CS6200 Information Retreival. The WebGraph. July 13, 2015

Big Data Analytics CSCI 4030

Link Analysis. Hongning Wang

COMP5331: Knowledge Discovery and Data Mining

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Searching the Web [Arasu 01]

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Web Structure Mining using Link Analysis Algorithms

Introduction to Data Mining

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Slides based on those in:

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

Lecture 8: Linkage algorithms and web search

Brief (non-technical) history

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

Collaborative filtering based on a random walk model on a graph

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

Visualization and text mining of patent and non-patent data

Learning to Rank Networked Entities

CS425: Algorithms for Web Scale Data

Introduction to Information Retrieval

CSI 445/660 Part 10 (Link Analysis and Web Search)

Searching the Web What is this Page Known for? Luis De Alba

DATA MINING II - 1DL460. Spring 2014"

An Improved Computation of the PageRank Algorithm 1

Final Exam Search Engines ( / ) December 8, 2014

Introduction to Data Mining

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

Bibliometrics: Citation Analysis

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

The k-means Algorithm and Genetic Algorithm

Big Data Analytics CSCI 4030

DATA MINING - 1DL460

World Wide Web has specific challenges and opportunities

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Link Structure Analysis

TODAY S LECTURE HYPERTEXT AND

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Database System Concepts

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Abstract. 1. Introduction

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Database System Concepts

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Unit VIII. Chapter 9. Link Analysis

Chapter 6: Information Retrieval and Web Search. An introduction

Analysis of Large Graphs: TrustRank and WebSpam

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Information Retrieval

Social Network Analysis

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Unsupervised Learning

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

CPSC 340: Machine Learning and Data Mining. Ranking Fall 2016

DATA MINING - 1DL460

PageRank and related algorithms

Part I: Data Mining Foundations

Ranking on Data Manifolds

Finding Top UI/UX Design Talent on Adobe Behance

On Finding Power Method in Spreading Activation Search

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

Transcription:

Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti.

Outline 1 2 3 4

Outline 1 2 3 4

Traditional IR Traditional IR systems: Worth of a document regarding a query is intrinsic to the document. Documents are self-contained units Documents are descriptive and truthful

Web IR The World Wide Web is a shifting universe Indefinitely growing Non-textual content Invisible keywords Documents are not self-complete Most web queries 2 words long Hyperlinked

The Web as a Graph Web as a hyperlink graph Evolves organically, No central coordination, Yet shows global and local properties An example of social network

Graph structure of the Web

Graph structure of the Web

Network Analysis Studies properties related to connectivity and distances in graphs Example applications: Epidemiology Identifying a few nodes to be removed to significantly increase average path length between pairs of nodes Citation analysis Identifying influential or central papers

The Web as a Network Hypermedia is a social network network theory Extensive research applying graph notions Centrality and prestige Co-citation (relevance judgment) Applications Web search:, Google Classification and topic distillation

Exploiting the link structure Ranking search results Keyword queries not selective enough Use graph notions of popularity/prestige and Supervised and unsupervised learning Hyperlinks and content are strongly correlated Can be used to classify documents Can be used to cluster documents

Outline 1 2 3 4

Document citation graph Node adjacency matrix E E[i, j] = 1 iff document i cites document j, and zero otherwise. Prestige vector over all nodes: p Prestige p[v] associated with every node v

Centrality Graph-based notions of centrality Distance d(u,v) = number of links between u and v Radius of node u is Center of the graph is r(u) = max d(u,v) v center = argmin r(u) u Example: Influential papers in an area of research by looking for papers u with small r(u)

Co-citation Documents v and w are said to be co-cited by u if document u cites documents v and w If E is the document citation matrix E T E is the co-citation index matrix Indicator of relatedness between every v and w Example use: clustering Using above pair-wise relatedness measure in a clustering algorithm

Example clustering

Link-based Ranking Strategies Goal: Leverage the abundance problems inherent in broad queries Google s ing Measure of prestige with every page on web : Hyperlink Induced Topic Search Use query to select a sub-graph from the Web. Identify hubs and authorities in the sub-graph

Link Model Each page is a node without any textual properties Each hyperlink is an edge connecting two nodes with possibly only a positive edge weight property Some preprocessing procedure outside the scope of the algorithm may be used to choose what sub-graph of the Web to analyze, in response to a query

Outline 1 2 3 4

Overview of Pre-computes a rank-vector Provides a-priori (offline) importance estimates for all pages on Web Independent of search query In-degree prestige Not all votes are worth the same Prestige of a page is the sum of prestige of citing pages: p = Ep Pre-compute query independent prestige score Query time: prestige scores used in conjunction with query-specific IR scores

Assumption: the prestige of a page is proportional to the sum of the prestige scores of pages linking to it Idea of a random surfer on a strongly connected web graph

The algorithm: E is adjacency matrix of the Web { 1 iff there is a link from u to v E[u, v] = 0 otherwise The out-degree of node u is given by N u = v E[u, v] Start with an initial prestige vector p 0 [u] Compute p i+1 [v] = (u,v) E p i [u] N u

Convergence Convergence to stationary distribution of the normalized adjacency matrix L p is principal eigenvector of L Convergence criteria L is irreducible there is a directed path from every node to every other node L is aperiodic if there is no integer k > 1 that divides the length of every cycle of the graph

Problems of Convergence Web graph is not strongly connected Only a fourth of the graph is! Web graph is not aperiodic Rank-sinks Pages without out-links Directed cyclic paths

A simple fix Two way choice at each node With a certain probability d (0.1 < d < 0.2), the surfer jumps to a random page on the Web With probability 1 d the surfer decides to choose, uniformly at random, an out-neighbor p i+1 [v] = d N + (1 d) (u,v) E p i [u] N u

architecture at Google Ranking of pages more important than exact values of p Convergence of page ranks in 52 iterations for a crawl with 322 million links. Pre-compute and store the of each page. independent of any query or textual content. Ranking scheme combines with textual match Unpublished Many empirical parameters, human effort and regression testing.

Outline 1 2 3 4

: Hypertext Induced Topic Selection Relies on query-time processing To select base set V q of links for query q constructed by selecting a sub-graph R from the Web (root set) relevant to the query selecting any node u which neighbors any r R via an inbound or outbound edge (expanded set) To deduce hubs and authorities that exist in a sub-graph of the Web Every page u has two distinct measures of merit, its hub score h[u] its authority score a[u] Recursive quantitative definitions of hub and authority scores

Hubs and Authorities Hub A page is a good hub if it contains links to many good authority pages Authority A page is a good authority if it is pointed to by many good hubs Authority pages provide good content. Hub pages provide links to the pages with good content.

The Algorithm

vs. advantage over Query-time cost is low computes an eigenvector for every query Less susceptible to localized link-spam advantage over ranking is sensitive to query has notion of hubs and authorities

Questions?