Navigating the Web graph

Similar documents
Web Structure & Content (not a theory talk)

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

The link structure of the Web is attracting considerable

Mining Web Data. Lijun Zhang

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Personalizing PageRank Based on Domain Profiles

Wednesday, March 8, Complex Networks. Presenter: Jirakhom Ruttanavakul. CS 790R, University of Nevada, Reno

Web Crawling As Nonlinear Dynamics

Part I: Data Mining Foundations

Overlay (and P2P) Networks

Mining Web Data. Lijun Zhang

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Mapping the Semantics of Web Text and Links

ECS 289 / MAE 298, Lecture 9 April 29, Web search and decentralized search on small-worlds

Searching the Web [Arasu 01]

DATA MINING II - 1DL460. Spring 2014"

Information Retrieval Spring Web retrieval

An Evolving Network Model With Local-World Structure

FILTERING OF URLS USING WEBCRAWLER

NUCLEAR EXPERT WEB MINING SYSTEM: MONITORING AND ANALYSIS OF NUCLEAR ACCEPTANCE BY INFORMATION RETRIEVAL AND OPINION EXTRACTION ON THE INTERNET

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Improving Relevance Prediction for Focused Web Crawlers

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

Dynamic network generative model

Web Structure Mining using Link Analysis Algorithms

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Creating a Classifier for a Focused Web Crawler

Dynamic Visualization of Hubs and Authorities during Web Search

CAIM: Cerca i Anàlisi d Informació Massiva

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

(Social) Networks Analysis III. Prof. Dr. Daning Hu Department of Informatics University of Zurich

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM

DATA MINING - 1DL105, 1DL111

LINK context is utilized in various Web-based information

Search Engine Architecture. Hongning Wang

Application of rough ensemble classifier to web services categorization and focused crawling

Exploiting routing information encoded into backlinks to improve topical crawling

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Competitive Intelligence and Web Mining:

Approaches to Mining the Web

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Search Engines. Information Retrieval in Practice

Searching the Deep Web

The Structure of Information Networks. Jon Kleinberg. Cornell University

Information Retrieval. Lecture 11 - Link analysis

M.E.J. Newman: Models of the Small World

CS-E5740. Complex Networks. Scale-free networks

Graph Model Selection using Maximum Likelihood

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

An Introduction to Search Engines and Web Navigation

Properties of Biological Networks

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Recent Researches on Web Page Ranking

Information Retrieval May 15. Web retrieval

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Clustering Results. Result List Example. Clustering Results. Information Retrieval

A Survey on Web Information Retrieval Technologies

The Establishment Game. Motivation

Bruno Martins. 1 st Semester 2012/2013

Searching the Web What is this Page Known for? Luis De Alba

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff Dr Ahmed Rafea

Search Engines. Information Retrieval in Practice

An Improved PageRank Method based on Genetic Algorithm for Web Search

Genetic Algorithms Variations and Implementation Issues

arxiv:cond-mat/ v1 21 Oct 1999

Topic mash II: assortativity, resilience, link prediction CS224W

Administrative. Web crawlers. Web Crawlers and Link Analysis!

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Part 1: Link Analysis & Page Rank

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

TELCOM2125: Network Science and Analysis

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Link Analysis and Web Search

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

Summary: What We Have Learned So Far

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Adaptive Focused Crawling

Automatically Constructing a Directory of Molecular Biology Databases

How to explore big networks? Question: Perform a random walk on G. What is the average node degree among visited nodes, if avg degree in G is 200?

Failure in Complex Social Networks

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Problem 1: Complexity of Update Rules for Logistic Regression

Chapter 6: Information Retrieval and Web Search. An introduction

Crawler with Search Engine based Simple Web Application System for Forum Mining

Review of Various Web Page Ranking Algorithms in Web Structure Mining

Developing Focused Crawlers for Genre Specific Search Engines

Information Retrieval and Web Search

An Application of Personalized PageRank Vectors: Personalized Search Engine

Abstract. 1. Introduction

beyond social networks

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Transcription:

Navigating the Web graph Workshop on Networks and Navigation Santa Fe Institute, August 2008 Filippo Menczer Informatics & Computer Science Indiana University, Bloomington

Outline Topical locality: Content, link, and semantic topologies Implications for growth models and navigation Applications Topical Web crawlers Distributed collaborative peer search

The Web as a text corpus Pages close in word vector space tend to be related Cluster hypothesis (van Rijsbergen 1979) mass destruction p 1 p 2 weapons The WebCrawler (Pinkerton 1994) The whole first generation of search engines

Enter the Web s link structure p(i) = α N + (1 α) j:j i Brin & Page 1998 p(j) l : j l Barabasi & Albert 1999 Broder & al. 2000

Three network topologies Text Links

Three network topologies Text Meaning Links

The link-cluster conjecture Connection between semantic topology (topicality or relevance) and link topology (hypertext) G = Pr[rel(p)] ~ fraction of relevant pages (generality) R = Pr[rel(p) rel(q) AND link(q,p)] Related nodes are clustered if R > G (modularity) Necessary and sufficient condition for a random crawler to find pages related to start points G = 5/15 C = 2 R = 3/6 = 2/4 ICML 1997

Link-cluster conjecture Stationary hit rate for a random crawler: η(t + 1) = η(t) R + (1 η(t)) G η(t) η t η G = 1 (R G) η > G R > G η G 1 = R G 1 (R G) Conjecture Value added

R(q,δ) G(q) Pr[ rel( p) rel(q) path(q, p) δ] Pr[rel(p)] Link-cluster conjecture Pages that link to each other tend to be related Preservation of semantics (meaning) A.k.a. topic drift L(q,δ) { p: path(q,p) δ } path(q, p) {p : path(q, p) δ} JASIST 2004

The link-content conjecture Correlation of lexical and linkage topology S(q,δ) { p: path(q,p) δ } sim(q, p) {p : path(q, p) δ} L(δ): average link distance S(δ): average similarity to start (topic) page from pages up to distance δ Correlation ρ(l,s) = 0.76 9

Heterogeneity of link-content correlation S = c + (1 c)e alb edu net gov org com signif. diff. a only (p<0.05) signif. diff. a & b (p<0.05)

Mapping the relationship between links, content, and semantic topologies Given any pair of pages, need similarity or proximity metric for each topology: Content: textual/lexical (cosine) similarity Link: co-citation/bibliographic coupling Semantic: relatedness inferred from manual classification Data: Open Directory Project (dmoz.org) ~ 1 M pages after cleanup ~ 1.3*10 12 page pairs!

σ ( c p 1, p ) 2 = p 1 p 2 p 1 p 2 term j p 1 p 2 Content similarity term k term i Link similarity p 1 p 2 σ l ( p 1, p 2 ) = U p 1 U p2 U p1 U p2

Semantic similarity top lca c 2 c 1 Information-theoretic measure based on classification tree (Lin 1998) σ s (c 1,c 2 ) = 2logPr[lca(c 1,c 2 )] logpr[c 1 ] + logpr[c 2 ] Classic path distance in special case of balanced tree

Individual metric distributions conten t semantic link

Precision = Recall = Retrieved & Relevant Retrieved Retrieved & Relevant Relevant

Precision = Recall = Retrieved & Relevant Retrieved Retrieved & Relevant Relevant P(s c,s l ) = { p,q:σ c = s c,σ l = s l } σ s ( p,q) {p,q :σ c = s c,σ l = s l } Averaging semantic similarity R(s c,s l ) = { p,q:σ c = s c,σ l = s l } { p,q} σ s ( p,q) σ s ( p,q) Summing semantic similarity

Science log Recall σ l Precision σ c

log Recall Adult σ l Precision σ c

log Recall News σ l Precision σ c

All pairs log Recall σ l Precision σ c

Outline Topical locality: Content, link, and semantic topologies Implications for growth models and navigation Applications Topical Web crawlers Distributed collaborative peer search

Link probability vs lexical distance r =1 σ c 1 Pr(λ ρ) = ( p,q) : r = ρ σ l > λ ( p,q) : r = ρ

Link probability vs lexical distance r =1 σ c 1 Pr(λ ρ) = ( p,q) : r = ρ σ l > λ ( p,q) : r = ρ Phase transition ρ * Power law tail Pr(λ ρ) ~ ρ α(λ) Proc. Natl. Acad. Sci. USA 99(22): 14014-14019, 2002

Local content-based growth model k(i) if r( p Pr(p t p i<t ) = i, p t ) < ρ * mt c[r(p i, p t )] α otherwise Similar to preferential attachment (BA) Use degree info (popularity/ importance) only for nearby (similar/ related) pages

So, many models can predict degree distributions... Which is right? Need an independent observation (other than degree) to validate models Distribution of content similarity across linked pairs

None of these models is right!

The mixture model Pr(i) ψ 1 degree-uniform mixture t + (1 ψ) k(i) mt a i 2 b i 2 c i 2 i 1 i 1 i 1 t i 3 t i 3 t i 3

The mixture model Pr(i) ψ 1 degree-uniform mixture t + (1 ψ) k(i) mt a i 2 b i 2 c i 2 i 1 i 1 i 1 t i 3 t i 3 t i 3 Bias choice by content similarity instead of uniform distribution

Degree-similarity mixture model Pr(i) ψ ˆPr(i) + (1 ψ) k(i) mt

Degree-similarity mixture model Pr(i) ψ ˆPr(i) + (1 ψ) k(i) mt ˆPr(i) [r(i, t)] α ψ = 0.2, α = 1.7

Both mixture models get the degree distribution right

but the degree-similarity mixture model predicts the similarity distribution better Proc. Natl. Acad. Sci. USA 101: 5261-5265, 2004

Citation networks 1 PNAS 15,785 articles published in PNAS between 1997 and 2002! l 0.75 0.5 0.25 2 0 0 0 0.25 0.5 0.75 1! c

Citation networks

Citation networks

Open Questions Understand distribution of content similarity across all pairs of pages Growth model to explain co-evolution of both link topology and content similarity The role of search engines

Efficient crawling algorithms? Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages: ~ log N (Barabasi & Albert 1999) ~ log N / log log N (Bollobas 2001)

Efficient crawling algorithms? Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages: ~ log N (Barabasi & Albert 1999) ~ log N / log log N (Bollobas 2001) Practice: can t find them! Greedy algorithms based on location in geographical small world networks: ~ poly(n) (Kleinberg 2000) Greedy algorithms based on degree in power law networks: ~ N (Adamic, Huberman & al. 2001)

Exception # 1 Geographical networks (Kleinberg 2000) Local links to all lattice neighbors Long-range link probability distribution: power law Pr ~ r α r: lattice (Manhattan) distance α: constant clustering exponent t ~ log 2 N α = D

Is the Web a geographical network? Replace lattice distance by lexical distance r = (1 / σ c ) 1 local links long range links (power law tail)

Exception # 2 h=1 h=2 Hierarchical networks (Kleinberg 2002, Watts & al. 2002) Nodes are classified at the leaves of tree Link probability distribution: exponential tail Pr ~ e h h: tree distance (height of lowest common ancestor) t ~ log ε N,ε 1

Is the Web a hierarchical network? Replace tree distance by semantic distance h = 1 σ s top exponential tail lca c 2 c 1

Take home message: the Web is a friendly place!

Outline Topical locality: Content, link, and semantic topologies Implications for growth models and navigation Applications Topical Web crawlers Distributed collaborative peer search

Crawler applications Universal Crawlers Search engines! Topical crawlers Live search (e.g., myspiders.informatics.indiana.edu) Topical search engines & portals Business intelligence (find competitors/partners) Distributed, collaborative search

Topical [sic] crawlers spears

Evaluating topical crawlers Goal: build better crawlers to support applications Build an unbiased evaluation framework Define common tasks of measurable difficulty Identify topics, relevant targets Identify appropriate performance measures Effectiveness: quality of crawler pages, order, etc. Efficiency: separate CPU & memory of crawler algorithms from bandwidth & common utilities Information Retrieval 2005

Evaluating topical crawlers: Topics Automate evaluation using edited directories Different sources of relevance assessments Keywords Description Targets

Evaluating topical crawlers: Tasks Start from seeds, find targets and/or pages similar to target descriptions d=3 d=2

Examples of crawling algorithms Breadth-First Visit links in order encountered Best-First Priority queue sorted by similarity Variants: explore top N at a time tag tree context hub scores SharkSearch Priority queue sorted by combination of similarity, anchor text, similarity of parent, etc. InfoSpiders

Examples of crawling algorithms Breadth-First Visit links in order encountered Best-First Priority queue sorted by similarity Variants: explore top N at a time tag tree context hub scores SharkSearch Priority queue sorted by combination of similarity, anchor text, similarity of parent, etc. InfoSpiders

Exploration vs. Exploitation Avg target recall Pages crawled

Co-citation: hub scores Link score hub = linear combination between link and hub score

Recall (159 ODP topics) average target recall@n (%) 45 40 35 30 25 20 15 10 Breadth-First Naive Best-First DOM Hub-Seeker average target recall@n (%) 45 40 35 30 25 20 15 10 Breadth-First Naive Best-First DOM Hub-Seeker 5 5 0 0 2000 4000 6000 8000 10000 N (pages crawled) 0 0 2000 4000 6000 8000 10000 N (pages crawled) Split ODP URLs between seeds and targets Add 10 best hubs to seeds for 94 topics ECDL 2003

InfoSpiders adaptive distributed algorithm using an evolving population of learning agents

InfoSpiders adaptive distributed algorithm using an evolving population of learning agents local frontier keyword vector neural net

InfoSpiders adaptive distributed algorithm using an evolving population of learning agents local frontier keyword vector neural net offspring

match resource bias Evolutionary Local Selection Algorithm (ELSA) Foreach agent thread: Pick & follow link from local frontier Evaluate new links, merge frontier Adjust link estimator E := E + payoff - cost If E < 0: Die reinforcement learning Elsif E > Selection_Threshold: Clone offspring Split energy with offspring Split frontier with offspring Mutate offspring selective query expansion

Action selection

Q-learning Compare estimated relevance of visited document with estimated relevance of link followed from previous page Teaching input: E(D) + µ max l(d) λ l

Performance Avg target recall Pages crawled ACM Trans. Internet Technology 2003

http://sixearch.org

http://sixearch.org

http://sixearch.org

WWW 6S: Collaborative Peer Search bookmarks Index Crawler Peer query query local storage query A query B WWW2004 WWW2005 WTAS2005 P2PIR2006 query C

WWW 6S: Collaborative Peer Search bookmarks Index Crawler Peer query query local storage query query A hit B WWW2004 WWW2005 WTAS2005 P2PIR2006 query hit Data mining & referral opportunities C

WWW 6S: Collaborative Peer Search bookmarks Index Crawler Peer query query local storage query query A hit B WWW2004 WWW2005 WTAS2005 P2PIR2006 query hit Data mining & referral opportunities C Emerging communities

Reinforcement Learning

Query Routing

Simulating 500 Users ODP (dmoz.org) Maguitman, Menczer et al.: Algorithmic computation and approximation of semantic similarity. WWW2005, WWWJ 2006

Simulating 500 Users ODP (dmoz.org) Maguitman, Menczer et al.: Algorithmic computation and approximation of semantic similarity. WWW2005, WWWJ 2006

Simulating 500 Users ODP (dmoz.org) Maguitman, Menczer et al.: Algorithmic computation and approximation of semantic similarity. WWW2005, WWWJ 2006

P@10

0.7 0.6 0.5 Precision 0.4 0.3 0.2 0.1 6S Centralized Search Engine Google 0 0 0.05 0.1 0.15 0.2 0.25 Recall Distributed vs Centralized

#%! 39)-:*+.3;*<<,3,*5: =,0>*:*+ #$! #!! 01*+02*.34052*.678 '! &! %! $!!!$!! " #! #" $! $" ()*+,*-./*+./**+ Small-world

Semantic Similarity

Business/E-Commerce/Developers Business/Telecommunications/Call_Centers Computers/Programming/Graphics Shopping/Clothing/Accessories Shopping/Sports/Cycling Semantic Similarity Sports/Cycling/Racing Business/Arts_and_Entertainment/Fashion Shopping/Clothing/Footwear Shopping/Clothing/Uniforms Arts/Movies/Filmmaking Shopping/Visual_Arts/Artist_Created_Prints Health/Professions/Midwifery Health/Reproductive_Health/Birth_Control Home/Family/Pregnancy Society/Issues/Abortion Health/Conditions_and_Diseases/Cancer Health/Mental_Health/Grief,_Loss_and_Bereavement Society/People/Women

Ongoing Work Improve coverage/diversity in query routing algorithm Spam protection: trust/reputation subsystem User study with 6S application

User study

Query network

Result network

http://sixearch.org Questions?

Thank you! Questions? informatics.indiana.edu/fil Research supported by NSF CAREER Award IIS-0348940