COMP 4601 Web Crawling
What is Web Crawling? Process by which an agent traverses links on a web page For each page visited generally store content Start from a root page (or several) 2
MoFvaFon for Crawling Want to create a view of the World Wide Web Interested in a graph represenfng linked pages Graph provides: Ability to measure distance Ability to measure node importance Node content (page) can be indexed 3
MoFvaFon for Web Crawling Want to perform informa(on extrac(on. InformaFon ExtracFon (IE): IE is the task of automafcally extracfng structured informafon from unstructured and/or semi-structured machine-readable documents. In most of the cases this acfvity concerns processing human language texts by means of natural language processing (NLP). Recent acfvifes in mulfmedia document processing like automafc annotafon and content extracfon out of images/audio/ video could be seen as informafon extracfon. 4
What is a Web Crawler According to Wikipedia: hvp://en.wikipedia.org/wiki/web_crawler A Web crawler is an Internet bot that systemafcally browses the World Wide Web, typically for the purpose of Web indexing. Can potenfally crawl any graph; e.g., Facebook social network 5
Web Crawler From: hvp://en.wikipedia.org/wiki/web_crawler 6
Structure of a Web Crawler Behaviour defined by policies: SelecFon policy Re-visit policy Politeness policy ParallelizaFon policy 7
SelecFon Policy Even large search engines cover only 40-70% of indexable web. Need to provide a metric of importance for priorifzing pages. Page importance a funcfon of: Intrinsic quality Popularity of its links 8
SelecFon Policy Cho et al used 180,000 page data set from stanford.edu domain Tested: Breadth-first Backlink-count ParFal PageRank (PPR) calculafon Conclusion: For finding high PageRank pages, use PPR Study: hvp://oak.cs.ucla.edu/~cho/papers/cho-thesis.pdf 9
SelecFon Policy Najork and Wiener used 328 million pages using breadth-first explorafon. Found that strategy captures high PageRank pages early. 5 1 2 6 7 4 Idea is to visit all nodes at distance 3 8 1 from root node (labelled 0) followed by all nodes at distance 2 etc. Stop when we reach a certain distance (d max ) from the 9 root node. 10 0
Backlink Crawler Backlink-count This strategy crawls first the pages with the highest number of links poinfng to it, so the next page to be crawled is the most linked from the pages already downloaded. This strategy was described by Cho et al. [CGMP98]. See: hvp://chato.cl/papers/crawling_thesis/scheduling.pdf hvp://en.wikipedia.org/wiki/focused_crawler hvp://www10.org/cdrom/papers/208/ (Najork,Weiner) 11
Backlink Example Has 3 backlinks Has 2 backlinks Already downloaded 12
Online Page Importance ComputaFon (OPIC) OPIC This strategy is based on OPIC [APC03], which can be seen as a weighted backlink-count strategy. All pages start with the same amount of cash. Every Fme a page is crawled, its cash is split among the pages it links to. The priority of an uncrawled page is the sum of the cash it has received from the pages poinfng to it. This strategy is similar to Pagerank, but has no random links and the calculafon is not iterafve so it is much faster. 13
OPIC Example 5 OPIC=2.5+3+1=6.5 2 3 OPC=1+2.5=3.5 n Already downloaded, OPIC = n 14
RestricFng Followed Links May want to follow HTML pages May want to avoid specific MIME types May want to filter based upon URL; e.g., if there s a? in it then it is probably dynamically generated. 15
Re-visit Policy Typically, we re storing the pages that we visit (or creafng a hash of them) We need to re-visit them Want to compute 2 measures: Freshness: binary, whether local copy is accurate or not Age: indicates how outdated the local copy is 16
Age and Freshness p is a page in the above equa(ons 17
Re-visit Policy Uniform: Revisit with same frequency regardless of rate of change Propor(onal: Revisit in proporfon to rate of change 18
Politeness Policy Web crawlers work faster than humans Can retrieve A LOT of data Has significant performance impact on a site Robots.txt defines a robot exclusion protocol: hvp://en.wikipedia.org/wiki/robots_exclusion_standard Example: User-agent: * Disallow: User-agent: * Disallow: / Allow everybody Allow nobody 19
Crawl-delay Shouldn t access pages as fast as I can! User-agent: * Crawl-delay: 10 Wait 10 seconds between requests on the same server Web crawler delay: Fixed: 15 seconds (WIRE) AdapFve: If page took t secs, use 10t before next page. (MERCATORWEB) 20
Delay Setng It is a problem! Brin and Page note:.. running a crawler which connects to more than half a million servers (...) generates a fair amount of e-mail and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen. Conclusion: Use adap(ve with lower bound 21
ParallelizaFon Policy Use mulfple threads/processes to crawl in parallel. Need to: Dynamically assign URLs to different crawlers Want to balance load Ensure that we don t access a URL more than once Manage concurrency properly (i.e., serialize access to shared state) See: hvp://en.wikipedia.org/wiki/distributed_web_crawling 22
Architectures for Parallel Crawlers Shared Space Map Reduce 23
Shared Space 24
Map Reduce 25
Web Crawlers Many open source crawlers in Java hvp://java-source.net/open-source/crawlers Heritrix Used at Internet scale Highly extensible 26
Webcrawler Sowware Crawler4j Java, described in following slides Nutch Java Used in conjuncfon with Lucene (more later) See: hvp://en.wikipedia.org/wiki/nutch Several search engines built on it: Krugle (code search engine) DiscoverEd (open eduafonal resources) 27
crawler4j Crawler4j Open source web crawler WriVen in Java, is simple and fast Found at: hvps://github.com/yasserg/crawler4j Requires that user extend WebCrawler to implement a web crawler See BasicCrawler and ImageCrawler for examples See hvps://github.com/yasserg/crawler4j Configura(on Details for configurafon informafon 28
Crawler4j Details Have to implement 2 classes: Controller: defines parameters for crawl Seed URLs, storage folder, max pages crawled, Class with extends WebCrawler class Create instance of CrawlConfig Modify crawl parameters This should be converted to a parameter file! 29
WebCrawler: Important APIs boolean shouldvisit(page page, WebURL url) Determines whether a link should be followed Usually base this on a PaVern May also want to restrict to a domain void visit(page page) Called when a page is visited Allows analysis of page contents 30