Discussion 3: crawler4j

Size: px

Start display at page:

Download "Discussion 3: crawler4j"

Cathleen Freeman
5 years ago
Views:

1 Discussion 3: crawler4j Jan 22 nd, 2014 Content adapted from hfp://code.google.com/p/crawler4j/ INF 141 / CS 121 Tao Wang 1

2 Recall: a robust crawl architecture * Adapted from Jan 21 lecture INF 141 / CS 121 Tao Wang 2

3 Crawler4j makes it easy Configure your crawler local storage folder number of crawlers max depth/pages Politeness user agent string Proxy Resumable add seed Start crawling while (fetch next url from fronrer) If(this page should be visited) Extract data from this page Process data Extract outgoing links from this page Add links to fronrer INF 141 / CS 121 Tao Wang 3

4 Crawler4j makes it easy Configure your crawler local storage folder number of crawlers max depth/pages Politeness user agent string Proxy Resumable add seed Start crawling while (fetch next url from fronrer) If(this page should be visited) Extract data from this page Process data Extract outgoing links from this page Add links to fronrer INF 141 / CS 121 Tao Wang 4

5 implement a crawler Extends from WebCrawler class and override two methods boolean shouldvisit(weburl url); this funcron determines if a given url should be crawled (based on your own logic) void visit(page page); This funcron is where your processing happen Build index, record page starsrcs Outgoing links are added to fronrer by crawler4j INF 141 / CS 121 Tao Wang 5

6 create a CrawlController A local folder for intermediate crawl data INF 141 / CS 121 Tao Wang 6

7 create a CrawlController Number of concurrent crawling threads INF 141 / CS 121 Tao Wang 7

8 create a CrawlController Nothing needs to be changed here INF 141 / CS 121 Tao Wang 8

9 create a CrawlController Here are your url seeds INF 141 / CS 121 Tao Wang 9

10 other configurarons Maximum crawl depth: default is - 1 for unlimited depth. A - > B - > C - > D: A has depth 0. Max depth = 2 means D won t be crawled INF 141 / CS 121 Tao Wang 10

11 other configurarons Maximum number of pages to crawl: default is no limit INF 141 / CS 121 Tao Wang 11

12 other configurarons Politeness INF 141 / CS 121 Tao Wang 12

13 other configurarons User agent string: used for represenrng your crawler to web services. Default is crawler4j (hfp://code.google.com/p/crawler4j/). To change: crawlconfig.setuseragentstring(useragentstring); INF 141 / CS 121 Tao Wang 13

14 other configurarons Proxy * if you need to use proxy config.setproxyhost("proxyserver.example.com"); config.setproxyport(8080); * If your proxy also needs authenrcaron: config.setproxyusername(username); config.getproxypassword(password); INF 141 / CS 121 Tao Wang 14

15 other configurarons Resumable crawling If your crawler will run for a long Rme Possible unexpected terminaron Resume from a previously stopped/crashed crawl crawlconfig.setresumablecrawling(true); INF 141 / CS 121 Tao Wang 15

16 Other issues robots.txt robotstxtserver.allows(weburl): check if a url is allowed to be crawled Details of how crawler4j finds robots.txt RobotstxtServer.fetchDirecRves(URL url); Duplicated urls WebCrawler.processPage(WebURL cururl); Relies on a docid. Details are in class DocIDServer. INF 141 / CS 121 Tao Wang 16

17 learn more about crawler4j hfp://code.google.com/p/crawler4j/ All content in this presentaron is adapted from this site Limited documentaron on the site Source code available git repository: hfps://crawler4j.googlecode.com/git/. Download samples: hfps://crawler4j.googlecode.com/archive/ e14a eaba b2ce779e0d99bbf.zip Crawler4j source code is available in the sample package INF 141 / CS 121 Tao Wang 17

18 Discussion 3: crawler4j Jan 22 nd, 2014 Content adapted from hfp://code.google.com/p/crawler4j/ INF 141 / CS 121 Tao Wang 18

Crawling Assignment. Introduction to Information Retrieval CS 150 Donald J. Patterson

Crawling Assignment. Introduction to Information Retrieval CS 150 Donald J. Patterson Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Robust Crawling A Robust Crawl Architecture DNS Doc. Fingerprints