Advanced Crawling Techniques. Outline. Web Crawler. Chapter 6. Selective Crawling Focused Crawling Distributed Crawling Web Dynamics

Similar documents
How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

Searching the Web What is this Page Known for? Luis De Alba

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho,

CS6200 Information Retreival. Crawling. June 10, 2015

Searching the Deep Web

Parallel Crawlers. Junghoo Cho University of California, Los Angeles. Hector Garcia-Molina Stanford University.

Search Engines. Information Retrieval in Practice

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

Link Analysis in Web Mining

Information Retrieval Spring Web retrieval

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

CS47300: Web Information Search and Management

Search and Optimization

Searching the Deep Web

Effective Page Refresh Policies for Web Crawlers

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

CSE 21 Mathematics for Algorithm and System Analysis

Mining Web Data. Lijun Zhang

CSE 431/531: Analysis of Algorithms. Greedy Algorithms. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

CSE 431/531: Algorithm Analysis and Design (Spring 2018) Greedy Algorithms. Lecturer: Shi Li

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Collection Building on the Web. Basic Algorithm

CS6200 Information Retreival. The WebGraph. July 13, 2015

Application of rough ensemble classifier to web services categorization and focused crawling

Data Mining Part 5. Prediction

Lecture 7: Decision Trees

Information Retrieval May 15. Web retrieval

Design and implementation of an incremental crawler for large scale web. archives

Title: Artificial Intelligence: an illustration of one approach.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Based on Raymond J. Mooney s slides

News Page Discovery Policy for Instant Crawlers

DATA MINING - 1DL105, 1DL111

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

CS47300 Web Information Search and Management

arxiv: v1 [cs.ne] 22 Mar 2016

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

Focused Crawling with

Classification. 1 o Semestre 2007/2008

Evaluation Methods for Focused Crawling

DATA MINING II - 1DL460. Spring 2014"

Introduction to Data Mining

Using Machine Learning to Optimize Storage Systems

Slides based on those in:

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

FILTERING OF URLS USING WEBCRAWLER

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

World Wide Web has specific challenges and opportunities

Simulation Study of Language Specific Web Crawling

Optimizing Content Freshness of Relations Extracted From the Web Using Keyword Search

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

An Adaptive Model for Optimizing Performance of an Incremental Web Crawler

Information Retrieval. Lecture 10 - Web crawling

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Focused Crawling with

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Creating a Classifier for a Focused Web Crawler

Joint Entity Resolution

A Survey on Web Information Retrieval Technologies

Mixture Models and the EM Algorithm

Parallel Programming. Parallel algorithms Combinatorial Search

Web Crawling. Advanced methods of Information Retrieval. Gerhard Gossen Gerhard Gossen Web Crawling / 57

Resource Allocation Strategies for Multiple Job Classes

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Modern Information Retrieval

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

Quiz 1 Solutions. (a) f(n) = n g(n) = log n Circle all that apply: f = O(g) f = Θ(g) f = Ω(g)

Lecture outline. Decision-tree classification

Adaptive Focused Crawling

Web Crawling As Nonlinear Dynamics

Information Retrieval

B-Trees and External Memory

Parallel Computing in Combinatorial Optimization

Chapter 12: Indexing and Hashing. Basic Concepts

Greedy Approach: Intro

B-Trees and External Memory

Data Structure. IBPS SO (IT- Officer) Exam 2017

Chapter 12: Indexing and Hashing

Automatically Constructing a Directory of Molecular Biology Databases

Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine

File Structures and Indexing

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013)

SDD Advanced-User Manual Version 1.1

Clustering: Classic Methods and Modern Views

kd-trees Idea: Each level of the tree compares against 1 dimension. Let s us have only two children at each node (instead of 2 d )

Part I: Data Mining Foundations

Monte Carlo Methods; Combinatorial Search

Transcription:

Chapter 6 Advanced Crawling Techniques Outline Selective Crawling Focused Crawling Distributed Crawling Web Dynamics Web Crawler Program that autonomously navigates the web and downloads documents For a simple crawler start with a seed URL, S0 download all reachable pages from S0 repeat the process for each new page until a sufficient number of pages are retrieved Ideal crawler recognize relevant pages limit fetching to most relevant pages 1

Nature of Crawl Broadly categorized into Exhaustive crawl broad coverage used by general purpose search engines Selective crawl fetch pages according to some criteria, for e.g., popular pages, similar pages exploit semantic content, rich contextual aspects Selective Crawling Retrieve web pages according to some criteria Page relevance is determined by a scoring function sθ(ξ)(u) ξ relevance criterion θ parameters for e.g., a boolean relevance function s(u) =1 document is relevant s(u) =0 document is irrelevant Selective Crawler Basic approach sort the fetched URLs according to a relevance score use best-first search to obtain pages with a high score first search leads to most relevant pages 2

Examples of Scoring Function Depth length of the path from the site homepage to the document limit total number of levels retrieved from a site maximize coverage breadth 1, if root ( u ) u < δ s δ( depth ) ( u ) = 0, otherwise Popularity assign relevance according to which pages are more important than others estimate the number of backlinks 1, if indegree(u) > τ sδ(backlinks ) (u ) = 0, otherwise Examples of Scoring Function PageRank assign value of importance value is proportional to the popularity of the source document estimated by a measure of indegree of a page Efficiency of Selective Crawler BFS crawler Crawler using backlinks Crawler using PageRank 3

Focused Crawling Fetch pages within a certain topic Relevance function use text categorization techniques sθ(topic)(u) = P(c d(u), θ ) Parent based method score of parent is extended to children URL Anchor based method anchor text is used for scoring pages Focused Crawler Basic approach classify crawled pages into categories use a topic taxonomy, provide example URLs, and mark categories of interest use a Bayesian classifier to find P(c p) compute relevance score for each page R(p) = c good P(c p) Focused Crawler Soft Focusing compute score for a fetched document, S0 extend the score to all URL in S0 sθ(topic)(u) = P(c d(v), θ ) if same URL is fetched from multiple parents, update s(u) Hard Focusing for a crawled page d, find leaf node with highest probability (c*) if some ancestor of c* is marked good, extract URLS from d else the crawl is pruned at d 4

Efficiency of a Focused Crawler Context Focused Crawlers Classifiers are trained to estimate the link distance between a crawled page and the relevant pages use context graph of L layers for each seed page Context Graphs Seed page forms layer 0 Layer i contains all the parents of the nodes in layer i-1 5

Context Graphs To compute the relevance function set of Naïve Bayes classifiers are built for each layer compute P(t c1) from the pages in each layer compute P(c1 p) class with highest probability is assigned the page if (P (c1 p) < τ, then page is assigned to other class Context Graphs Maintain a queue for each layer Sort queue by probability scores P(cl p) For the next URL in the crawler pick top page from the queue with smallest l results in pages that are closer to the relevant page first explore outlink of such pages Reinforcement Learning Learning what action yields maximum rewards To maximize rewards learning agent uses previously tried actions that produced effective rewards explore better action selections in future Properties trial and error method delayed rewards 6

Elements of Reinforcement Learning Policy π(s,a) probability of taking an action a in state s Rewards function r(a) maps state-action pairs to a single number indicate immediate desirability of the state Value Function Vπ(s) indicate long-term desirability of states takes into account the states that are likely to follow, and the rewards available in those states Reinforcement Learning Optimal policy π* maximizes value function over all states V * ( s) = max E[r + γv * ( s ) s = s, a = a] a A ( s ) t +1 t +1 t t LASER uses reinforcement learning for indexing of web pages for a user query, determine relevance using TFIDF propagate rewards into the web discounting them at each step, by value iteration after convergence, documents at distance k from u provides a contribution γk times their relevance to the relevance of u Fish Search Web agents are like the fishes in sea gain energy when a relevant document found agents search for more relevant documents lose energy when exploring irrelevant pages Limitations assigns discrete relevance scores 1 relevant, 0 or 0.5 for irrelevant low discrimination of the priority of pages 7

Shark Search Algorithm Introduces real-valued relevance scores based on ancestral relevance score anchor text textual context of the link Distributed Crawling A single crawling process insufficient for large-scale engines data fetched through single physical link Distributed crawling scalable system divide and conquer decrease hardware requirements increase overall download speed and reliability Parallelization Physical links reflect geographical neighborhoods Edges of the Web graph associated with communities across geographical borders Hence, significant overlap among collections of fetched documents Performance of parallelization communication overhead overlap coverage quality 8

Performance of Parallelization Communication overhead fraction of bandwidth spent to coordinate the activity of the separate processes, with respect to the bandwidth usefully spent to document fetching Overlap fraction of duplicate documents Coverage fraction of documents reachable from the seeds that are actually downloaded Quality e.g. some of the scoring functions depend on link structure, which can be partially lost Crawler Interaction Recent study by Cho and Garcia-Molina (2002) Defined framework to characterize interaction among a set of crawlers Several dimensions coordination confinement partitioning Coordination The way different processes agree about the subset of pages to crawl Independent processes degree of overlap controlled only by seeds significant overlap expected picking good seed sets is a challenge Coordinate a pool of crawlers partition the Web into subgraphs static coordination partition decided before crawling, not changed thereafter dynamic coordination partition modified during crawling (reassignment policy must be controlled by an external supervisor) 9

Confinement Specifies how strictly each (statically coordinated) crawler should operate within its own partition Firewall mode each process remains strictly within its partition zero overlap, poor coverage Crossover mode a process follows interpartition links when its queue does not contain any more URLs in its own partition good coverage, potentially high overlap Exchange mode a process never follows interpartition links can periodically dispatch the foreign URLs to appropriate processes no overlap, perfect coverage, communication overhead Crawler Coordination Let Aij be the set of documents belonging to partition i that can be reached from the seeds Sj Partitioning A strategy to split URLs into nonoverlapping subsets to be assigned to each process compute a hash function of the IP address in the URL e.g. if n {0,,232-1} corresponds to IP address m is the number of processes documents with n mod m = i assigned to process i take to account geographical dislocation of networks 10

Web Dynamics Rate of change of information on the Web Used by search engines for updating index Notion of recency proposed by Brewington and Cybenko (2000) The index entry for a document indexed at time to is β-current at time t if the document has not changed during [t0,t-β] β is a grace period A search engine for a given collection is (α,β)current if the probability that a document is β-current is at least α Lifetime of Documents T lifetime of a component the time when a component breaks down a continuous random variable The cumulative distribution function (cfd) of lifetime F(t) = P( T t) Reliability (suvivorship function) - the probability that the component will be functioning at time t S (t ) = 1 F (t ) = P(T > t ) Aging of Components Age of a component time elapsed since the last replacement Cdf of age the expected fraction of components that are still operating at t t S (τ )dτ G (t ) = P (age < t ) = S (τ )dτ 0 0 The age probability density function (pdf) g(t) is proportional to the suvivorship function 11

Lifetime vs. Age Aging of Documents S(t) is the probability that a document last changed at time zero will remain unmodified at time t G(t) is the expected fraction of documents that are older that t Probability that a document will be modified before an additional time h has passed is the conditional probability P( t < T t + h T > t ) The change rate 1 f (t ) λ (t ) = lim P(t < T t + h T > t ) = h 0 h S (t ) - where f(t) is the lifetime pdf Age and Lifetime on the Web Assuming constant change rate, estimate lifetime from observed age F( t ) = 1 - λ e -λt, g( t ) = f( t ) = λ e -λt use Last-Modified timestamp in the HTTP header Brewington and Cybenko (2000) 7 million Web pages observed between 1999 and 2000 12

Empirical Distributions Growth with Time Essential property that the Web is growing with time not captured by the model Most documents are young Suppose exponential distribution for growth if documents were never edited, their age is the time since their creation trivial growth model yields an exponential age distribution Realistic model should take to account both Web growth and document refreshing use hybrid model Meaningful Timestamps Often servers do not return meaningful timestamps especially for dynamic Web pages Thus, estimate of change rates using lifetimes rather than ages Brewington and Cybenko (2000) an improved model that explicitly takes to account the probability of observing a change, given the change rate and the timespan 13

Improved Model Document changes are controlled by an underlying Poisson process probability of observing a change is independent of previous changes wˆ (1 / λ ) Mean lifetimes are Weibull distributed Change rates and timespans are independent The resulting lifetime distribution f (t ) = λe λt wˆ (1 / λ )d (1 / λ ) 0 where w(1/λ) is an estimate of the mean lifetime Estimated Mean Lifetime Distribution Refresh Interval Estimate how often a crawler should refresh the index of a search engine to guarantee that it remains (α,β)-current Consider a single document Let t=0 be the time the document was last fetched Let I be the interval between two consecutive visits 14

Refresh Interval The probability that for a particular time t the document is unmodified in [0,t-β] is e α (t β ), t [ β, I ) t (0, β ) 1, The probability that a collection of documents is β-current is e λ (t β ),, t [ β 1 e ( I β / t ) + dt t/i I α = w(t ) 0 Example Brewington and Cybenko (2000) reindexing period of about 18 days assuming a Web size of 800 million pages to guarantee that 95% of repository was current up to one week ago Freshness Another measure of recency Freshness φ(t) at time t of a document binary function indicating whether the document is upto-date in the index at time t Expected freshness probability that the document did not change in ( 0, t ] E[ φ(t) ] = e-λt Corresponds to β-currency for β=0 If document d is refreshed regularly each I time units, average freshness is 1 e λi φ = λi 15

Index Age Index age a(t) of a document is the age of the document if the local copy is outdated, or zero if the local copy is fresh e.g. if the document was modified at time s (0,t), its index age is t s Expected index age at time t t 1 e λt 0 λ E[a (t )] = (t s )λe λs ds = t Index age average in (0,I) a= 1 e λi 1 I λ2 I λ 2 Recency and Synchronization Policies Not all sites change their pages at the same rate Cho and Garcia-Molina (1999) monitored about 720,000 popular Web pages a vast fraction of this Web is dynamic dramatically different results for top-level domains Study Results Average change interval found in a study. 270 popular sites were monitored for changes from 17 February to 24 June 1999. A sample of 3000 pages was collected from each site by visiting in breadth first order from the homepage 16

Optimal Synchronization Policy Resource allocation policy should consider specific site dynamics Simplifying assumptions: N documents of interest Estimated change rate λi, i = 1,,N A crawler regularly fetches each document i with a refresh interval Ii Fetch each document in constant time B the available bandwidth the number of documents that can be fetched in a time unit Optimal Resource Allocation Limited bandwidth constraint N 1 i =1 i I N The problem of optimal resource allocation select the refresh intervals Ii to maximize a recency measure of the resulting index E.g. maximize freshness N N 1 ( I1*,..., I N* ) = arg max φ (λi, φi ) subject to N I i,..., I N i =1 i =1 I i Example WebFountain (Edwards et al. 2001) A fully distributed and incremental crawler no central control repository entry of a document is updated as soon as it is fetched crawling process is never regarded complete changes are detected when a document is re-fetched documents are grouped by similar rates of change trade-off between re-fetching (freshness) and exploring (coverage) controlled by optimizing the number of old and new URLs 17