How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

Size: px

Start display at page:

Download "How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho"

Meryl Gallagher
6 years ago
Views:

1 How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho

2 Stanford InterLib Technologies Information Overload Service Heterogeneity Interoperability Economic Concerns Information Loss Value Filtering Physical Barriers Mobile Access IP Infrastructure Archival Repository

3 WebBase Architecture Client Client Webbase API WWW Retrieval Indexes Feature Repository Web Crawler Web Crawler Web Crawler Web Crawlers Repository Multicast Engine Client Client Client Client

4 What is a Crawler? init initial urls web get next url get page to visit urls visited urls extract urls web pages 4

5 Crawling at Stanford WebBase Project BackRub Search Engine, Page Rank Google New WebBase Crawler 5

6 Crawling Issues (1) Load at visited web sites robots.txt files space out requests to a site limit number of requests to a site per day limit depth of crawl multicast 6

7 Crawling Issues (2) Load at crawler parallelize? initial urls init get next url get page extract urls to visit urls visited urls init get next url get page extract urls web pages 7

8 Crawling Issues (3) Scope of crawl not enough space for all pages not enough time to visit all pages Solution: Visit important pages microsoft microsoft visited pages 8

9 Crawling Issues (4) Incremental crawling how do we keep pages fresh? how do we avoid crawling from scratch? Focus of this talk... 9

10 Web Evolution Experiment How often does a web page change? What is the lifespan of a page? How long does it take for 50% of the web to change? How do we model web changes? 10

11 Experimental Setup February 17 to June 24, sites visited (with permission) identified 400 sites with highest page rank contacted administrators 720,000 pages collected 3,000 pages from each site daily start at root, visit breadth first (get new & old pages) ran only 9pm - 6am, 10 seconds between site requests 11

12 How Often Does a Page Change? Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days Is this correct? changes 1 day page visited 12

13 Average Change Interval 13 fraction of pages

14 Average Change Interval By Domain fraction of pages 14

15 How Long Does a Page Live? page lifetime page lifetime experiment duration experiment duration page lifetime page lifetime experiment duration experiment duration 15

16 Page Lifespans 16 fraction of pages

17 Page Lifespans fraction of pages Method 1 used 17

18 Time for a 50% Change days 18 fraction of unchanged pages

19 Modeling Web Evolution Poisson process with rate λ T is time to next event f T (t) = λe -λt (t > 0) 19

20 Change Interval of Pages fraction of changes with given interval for pages that change every 10 days on average Poisson model interval in days 20

21 Freshness Change Metrics Freshness of element e i at time t is F( e i ; t ) = 1 if e i is up-to-date at time t 0 otherwise web e i database e i Freshness of the database S at time t is N F( S ; t ) = Σ F( e i ; t ) N i=1 21

22 Change Metrics Age Age of element e i at time t is web e i... database e i... A( e i ; t ) = 0 if e i is up-to-date at time t t - (modification e i time) otherwise Age of the database S at time t is 1 N A( S ; t ) = Σ A( e i ; t ) N i=1 22

23 Change Metrics F(e i ) 1 0 A(e i ) 0 time time Time averages: F( e i ) = lim F(e i ; t ) dt t t 1 t F( S ) = lim F(S ; t ) dt t t 1 t 0 0 update refresh similar for age... 23

24 Crawler Types In-place vs. shadow web database shadow database e i e i e i Steady vs. batch crawler on crawler off time 24

25 Comparison: Batch vs. Steady batch mode in-place crawler crawler running steady in-place crawler 25

26 Shadowing Steady Crawler without shadowing 26 current collection crawler s collection

27 Shadowing Batch Crawler without shadowing 27 current collection crawler s collection

28 Experimental Data Freshness Steady Batch In-Place Shadowing Pages change on average every 4 months Batch crawler works one week out of

29 Fixed order Refresh Order Example: Explicit list of URLs to visit Random Order Example: Start from seed URLs & follow links Purely Random Example: Refresh pages on demand, as requested by user database web e i e i

30 Freshness vs. Order steady, in-place crawler, pages change at same average rate r = λ / f = average change frequency / average visit frequency 30

31 Age vs. Order = Age / time to refresh all N elements r = λ / f = average change frequency / average visit frequency 31

32 Non-Uniform Change Rates fraction of elements g( λ ) λ 32

33 Trick Question Two page database e 1 changes daily e 2 changes once a week Can visit pages once a week How should we visit pages? e 1 e 1 e 1 e 1 e 1 e 1... e 1 e 2 database e 1 e 2 web e 2 e 2 e 2 e 2 e 2 e 2... e 1 e 2 e 1 e 2 e 1 e 2... [uniform] e 1 e 1 e 1 e 1 e 1 e 1 e 2 e 1 e 1... [proportional]? 33

34 Proportional Often Not Good! Visit fast changing e 1 get 1/2 day of freshness Visit slow changing e 2 get 1/2 week of freshness Visiting e 2 is a better deal! 34

35 Selecting Optimal Refresh Frequency Analysis is complex Shape of curve is the same in all cases Holds for any distribution g( λ ) 35

36 Optimal Refresh Frequency for Age Analysis is also complex Shape of curve is the same in all cases Holds for any distribution g( λ ) 36

37 Comparing Policies Freshness Age Proportional days Uniform days Optimal days In-place, steady crawler; Based on our experimental data [Pages change at different frequencies, as measured in experiment.] 37

38 Building an Incremental Crawler Steady In-place Variable visit frequencies Fixed order improves freshness! Also need: Page change frequency estimator Page replacement policy Page ranking policy (what are important pages?) See papers... 38

39 The End Thank you for your attention... For more information, visit Follow Searchable Publications link, and search for author = Cho 39

Effective Page Refresh Policies for Web Crawlers

For CS561 Web Data Management Spring 2013 University of Crete Effective Page Refresh Policies for Web Crawlers and a Semantic Web Document Ranking Model Roger-Alekos Berkley IMSE 2012/2014 Paper 1: Main