Web Crawling Advanced methods of Information Retrieval Gerhard Gossen 2015-06-04 Gerhard Gossen Web Crawling 2015-06-04 1 / 57
Agenda 1 Web Crawling 2 How to crawl the Web 3 Challenges 4 Architecture of a Web Crawler 5 Special applications Gerhard Gossen Web Crawling 2015-06-04 2 / 57
Definition The Web contains billions of interesting documents, but they are spread over millions of different machines. Before we can apply IR techniques on them, we need to systematically download them to our machines. The software to download such pages is called a Web crawler or spider (sometimes even bot). Gerhard Gossen Web Crawling 2015-06-04 3 / 57
Use cases Web search engines Web archives Data Mining Web Monitoring Gerhard Gossen Web Crawling 2015-06-04 4 / 57
In this lecture Technical specifications Social protocols Algorithms and data structures Gerhard Gossen Web Crawling 2015-06-04 5 / 57
Agenda 1 Web Crawling 2 How to crawl the Web 3 Challenges 4 Architecture of a Web Crawler 5 Special applications Gerhard Gossen Web Crawling 2015-06-04 6 / 57
Agenda 1 Web Crawling 2 How to crawl the Web 3 Challenges 4 Architecture of a Web Crawler 5 Special applications Gerhard Gossen Web Crawling 2015-06-04 7 / 57
URLs Each document is identified by a Uniform Resource Locator (URL) 1. Scheme Path http://www.example.com /show?doc=123 Host name Query string Scheme typically the access protocol (exceptions: mailto:, javascript:) Host name domain name or IP address of the server hosting the document Path file system-like path to the document Query string (optional) additional data, can be interpreted by software running on the server 1 RFC 3986 Uniform Resource Identifier (URI): Generic Syntax Gerhard Gossen Web Crawling 2015-06-04 8 / 57
Hyperlinks Web documents contain hyperlinks to other documents using their URLs. In HTML: <a href="http://www.example.org/">example link</a> which is displayed as: Example link All hyperlinks together form the link graph of the Web. Links are used (by humans and crawlers) to discover new documents Gerhard Gossen Web Crawling 2015-06-04 9 / 57
MIME types MIME (Multipurpose Internet Mail Extensions) 2 types give the technical format of a document. type parameters (optional) text/html;charset=utf-8 sub-type Originally specified for Email, now used in many Internet applications. IANA maintains the official registry. 3 2 RFC 2046 3 https://www.iana.org/assignments/media-types/media-types.xhtml Gerhard Gossen Web Crawling 2015-06-04 10 / 57
HTTP Most Web documents are accessed via the Hypertext Transfer Protocol (HTTP) 4. HTTP is an application protocol, relies on TCP/IP for network communications a request-response protocol stateless plain text based 4 RFC 7230 7235 Gerhard Gossen Web Crawling 2015-06-04 11 / 57
HTTP example Request GET /index.html HTTP/1.1 Host: www.example.com Response HTTP/1.1 200 OK Date: Mon, 01 Jun 2015 12:32:17 GMT Server: Apache Content-Type: text/html; charset=utf-8 Content-Length: 138 <html>... Gerhard Gossen Web Crawling 2015-06-04 12 / 57
HTTP request headers Both parties can give additional information using request resp. response headers. Typical request headers: Accept-* content types/languages/etc. the client can handle Cookie information from previous request to the server, e.g. to simulate sessions Referer URL of document linking to the currently requested document User-Agent Identifier of the software making the request, should be descriptive and contain contact information 5 5 Cf. http://commoncrawl.org/faqs/ Gerhard Gossen Web Crawling 2015-06-04 13 / 57
HTTP response headers Both parties can give additional information using request resp. response headers. Typical response headers: Content-Type MIME type of the content Caching Last-Modified / ETag / Expires Location redirect to different URL Set-Cookie asks client to send specified information on sub-sequent requests Gerhard Gossen Web Crawling 2015-06-04 14 / 57
HTTP response status codes The server signals the success or the cause of a failure using a status code. Typical status codes are: Code Meaning 1xx Informational 2xx Success 200 OK 3xx Redirection (permanent/temporary) 4xx Client error 403 Forbidden 404 Not found 5xx Server error Gerhard Gossen Web Crawling 2015-06-04 15 / 57
Crawler algorithm def crawler(initialurls): queue = [initialurls] while not queue.is_empty: url = queue.pop doc = download(url) out_urls = extract_links(doc) queue.append(out_urls) store(doc) The initial URLs of a crawl are also called seed URLs. Gerhard Gossen Web Crawling 2015-06-04 16 / 57
Agenda 1 Web Crawling 2 How to crawl the Web 3 Challenges 4 Architecture of a Web Crawler 5 Special applications Gerhard Gossen Web Crawling 2015-06-04 17 / 57
Scalability Crawler fetches documents accross the network from often busy/slow servers (up to several seconds per document) simple crawler waits most of the time parallelization can speed up crawler enormously (easily 100s of parallel fetches per machine) Large web crawls can max out processing / storage / network capacity of single machine distribute crawler accross cluster of crawling machines Gerhard Gossen Web Crawling 2015-06-04 18 / 57
Politeness Powerful crawlers can overload and take down smaller servers. 1 very impolite towards server operators 2 often leads to blocks against the crawler Countermeasure: wait some time between requests to the same server (typically.5 5s) Gerhard Gossen Web Crawling 2015-06-04 19 / 57
Robots Exclusion Protocol (robots.txt) Informal protocol: server operators can create a file called robots.txt, tells bot operators about acceptable behavior. Main function: Allow/disallow downloading of parts of the site. Relies on cooperation of bot operators. Example User-agent: googlebot Disallow: User-agent: * Disallow: / Gerhard Gossen Web Crawling 2015-06-04 20 / 57
Robustness A crawler can encounter many different types of errors: Hardware failure, disk full Network temporary failures, overloaded services (e.g. DNS) Servers slow to respond, send invalid responses Content malformed, size Additionally, server operators can accidentally or on purpose cause unwanted crawler behavior: Crawler traps site has huge number of irrelevant pages (e.g. calendars) Link farms spammers create huge networks of interlinked Web sites The crawler should be able to cope with these problems. Gerhard Gossen Web Crawling 2015-06-04 21 / 57
Prioritization The crawler can have different purposes (often in combination): Download large number of documents Download most relevant documents Download most important documents Keep document collection up to date... Often also a time constraint: e.g. update search index once per week. fixed budget of downloads, need to select URLs with the highest priority. Gerhard Gossen Web Crawling 2015-06-04 22 / 57
Conflicts between challenges Crawler design and operation needs to balance between conflicting challenges: politeness scalability politeness download speed completeness of crawl obeying robots.txt Some of these can be handled by an appropriate crawler architecture, others need human intervention/trade-off decisions. Gerhard Gossen Web Crawling 2015-06-04 23 / 57
Agenda 1 Web Crawling 2 How to crawl the Web 3 Challenges 4 Architecture of a Web Crawler 5 Special applications Gerhard Gossen Web Crawling 2015-06-04 24 / 57
Architecture Crawl process 1 DNS resolver & cache Host names IP addresses HTTP fetcher Link extractor URL distributor URL filter Duplicate URL eliminator URL prioritizer Queue URLs Crawl process 2 URLs DNS resolver & cache Host names IP addresses HTTP fetcher Link extractor URL distributor URL filter Duplicate URL eliminator URL prioritizer Queue Gerhard Gossen Web Crawling 2015-06-04 25 / 57
Crawler components HTTP fetcher fetches and caches robots.txt, then fetches the actual documents DNS resolver and cache resolves host names to IP addresses. Locality of URLs enables efficient caching Link extractor parses the page s HTML content and extracts links URL distributor distribute URLs to responsible crawler process URL filter remove unwanted URLs (spam, irrelevant,... ), normalize remaining URLs (remove session IDs,... ) Duplicate URL eliminator remove already fetched URLs (also called URL-seen test) URL prioritizer assign an expected relevance to the URL (e.g. using PageRank, update frequency, or similar measures) Gerhard Gossen Web Crawling 2015-06-04 26 / 57
Queue / Frontier Data structure that stores unfetched URLs. Requirements: ordered (queue) concurrent scalable to size larger than memory needs to ensure politeness Gerhard Gossen Web Crawling 2015-06-04 27 / 57
Simplest queue implementation: FIFO A first-in-first-out (FIFO) queue is the simplest implementation: Store URLs in a list Dequeue first item for fetcher Append new links at the end Drawbacks: No support for prioritization politeness limits throughput: Documents have typically many links to same host. Queue therefore has long stretches of URLs from the same host. Fetchers need to wait at each step until politeness interval is over. Possible solution: per-host queues Gerhard Gossen Web Crawling 2015-06-04 28 / 57
Per-Host queues: Mercator Front queues Back queues f 1 b 1 URL prioritizer f 2 f 3... front queue selector b 2 b 3... Back queue selector f n b m back queue priority queue Gerhard Gossen Web Crawling 2015-06-04 29 / 57
Mercator queue components Back queues Fixed number of non-empty FIFO queues, contain only URLs from same host Back queue priority queue contains (back queue, t) tuples ordered ascending by t, where t is the next allowed crawl time for the corresponding back queue Front queues FIFO queues containing URLs with (discrete) priority i Front queue selector picks a front queue randomly, but biased towards queues with higher priority URL prioritizer assign a priority [1, n] to new URL Host table Mapping from server name to assigned back queue Parts of the queue may be stored on disk and only brought into memory when needed. Gerhard Gossen Web Crawling 2015-06-04 30 / 57
Mercator queue algorithm Enqueue URLs are added to front queue corresponding to their priority Dequeue dequeue first element (bq, t) from back queue priority queue and wait until time t dequeue first URL from bq, download it if bq is now empty dequeue next URL u from front queues if corresponding back queue exist (cf. host table), insert u there and repeat otherwise: add URL u to bq, update host table enqueue (bq, t now + ) In Mercator, is download time politeness parameter (typically 10) to prefer faster servers. Gerhard Gossen Web Crawling 2015-06-04 31 / 57
Agenda 1 Web Crawling 2 How to crawl the Web 3 Challenges 4 Architecture of a Web Crawler 5 Special applications Gerhard Gossen Web Crawling 2015-06-04 32 / 57
Continuous crawling Changes in the Web Gerhard Gossen Web Crawling 2015-06-04 33 / 57
Continuous crawling Changes in the Web Changing content of already crawled pages Gerhard Gossen Web Crawling 2015-06-04 33 / 57
Continuous crawling Changes in the Web Changing content of already crawled pages Appearance / disappearance of pages Gerhard Gossen Web Crawling 2015-06-04 33 / 57
Continuous crawling Changes in the Web Changing content of already crawled pages Appearance / disappearance of pages Appearance / disappearance of links between pages Gerhard Gossen Web Crawling 2015-06-04 33 / 57
Continuous crawling Changes in the Web Changing content of already crawled pages Appearance / disappearance of pages Appearance / disappearance of links between pages Users of IR system change interests Gerhard Gossen Web Crawling 2015-06-04 33 / 57
Content changes Individual pages change their content often More than 40% change at least daily [CG00] But: No overall pattern, change occurs at different frequencies & time scales (seconds to years) Figure: Change rate of Web pages Frequency can be modelled as a Poisson process: With X (t)) number of occurrence of a change in (0, t] λ change rate for k = 0, 1,... Pr{X (s + t) X (t) = k} = (λt)k e λt k! Gerhard Gossen Web Crawling 2015-06-04 34 / 57
Change types [OP08] Temporal behavior of page (regions) can be classified as static no changes churn new content supplants old content, e.g., quote of the day scroll new content is appended to old content, e.g., blog entries Gerhard Gossen Web Crawling 2015-06-04 35 / 57
The Web changes [NCO04; Das+07] New pages are created at a rate of 8% per week During one year 80% of pages disappear New links are created at the rate of 25% per week significantly faster than the rate of new page creation Links are retired at about the same pace as pages Gerhard Gossen Web Crawling 2015-06-04 36 / 57
Users change User interests change Goals of IR system maintainer change requires adaptation of crawl strategy Gerhard Gossen Web Crawling 2015-06-04 37 / 57
Re-Crawling Strategies Crawlers need to balance different considerations: Coverage fetch new pages Freshness find updates of existing pages Figure: Crawl ordering model [ON10] Gerhard Gossen Web Crawling 2015-06-04 38 / 57
Basic strategies Batch crawling Stop and restart crawl process periodically Each document is only crawled one time per crawl Incremental crawling Crawling is run continuously A document can be crawled multiple times during a crawl Crawl frequency can differ between different sites Batch crawling is easier to implement, incremental more powerful. Gerhard Gossen Web Crawling 2015-06-04 39 / 57
Batch Crawling Strategies Goal is to maximize Weighted Coverage: WC(t) = w(p), p C(t) with t time since start of crawl C(t) pages crawled until time t w(p) weight of page p, (0 p 1) Main strategy types (ordered by complexity): Breadth-first search Order by in-degree Order by PageRank Figure: Weighted coverage as a function of time t Gerhard Gossen Web Crawling 2015-06-04 40 / 57
Incremental Crawling Strategies Goal is to maximize Weighted Freshness: WF (t) = w(p) f (p, t), p C(t) with f (p, t): freshness level of page p at time t Steady state average of WF: 1 t WF = lim WF (t)dt t t o Trade-off between coverage and freshness: Often treated as a business decision, needs to be tuned towards goals of specific application Gerhard Gossen Web Crawling 2015-06-04 41 / 57
Maximizing Freshness [CG03] Model estimation create a temporal model for each page p Resource allocation Given a maximum crawl rate r, decide on a revisitation frequency r(p) for each page Scheduling Produce a crawl order that implements the targeted revisitation frequencies as close as possible Gerhard Gossen Web Crawling 2015-06-04 42 / 57
Model estimation Create temporal model of temporal behavior of p given samples of past content p / pages similar to p Samples are often not be evenly-spaced Content can give hints about change frequency HTTP headers, number of links, depth of page in site Similar pages have similar behavior same site similar content similar link structure Gerhard Gossen Web Crawling 2015-06-04 43 / 57
Resource allocation Binary Freshness model { 1 if old copy is equal to live copy f (p, t) = 0 otherwise Intuitively good strategy: proportional resource allocation assign revisitation frequency proportional to change frequency But: uniform resource allocation achieves better average binary freshness assuming equal page weights Reason: Pages with high change frequency are stale very often regardless of crawl frequency (A) Pages with lower change frequency can be kept fresh more easily (B) Better to keep several pages of type B fresh than wasting resources on A Gerhard Gossen Web Crawling 2015-06-04 44 / 57
Resource allocation (continued) Continuous freshness model { 0 if old copy is equal to live copy age(p, t) = a otherwise a is the amount of time between cached and live copy revisitation frequency increases with change frequency a increases monotonically, crawler cannot give up on a page Instead of age, crawler can also consider content changes directly distinguish between long-lived and ephemeral content Gerhard Gossen Web Crawling 2015-06-04 45 / 57
Scheduling Goal: Produce a crawl ordering that implements the targeted revisitation frequencies as close as possible Uniform spacing of downloads of p achieves best results. Gerhard Gossen Web Crawling 2015-06-04 46 / 57
Focused Crawling Crawling the whole Web is very expensive, but often only documents about specific topics are relevant. Basic assumption: link homogeneity, pages preferentially link to similar pages links from relevant pages are typically more relevant. Example applications: vertical search engines (e.g. hotels, jobs) data mining Web archives Gerhard Gossen Web Crawling 2015-06-04 47 / 57
Deep Web Crawling / JavaScript crawling So far we only crawl by following links. But not all content is accessible through links: Web forms JavaScript applications Gerhard Gossen Web Crawling 2015-06-04 48 / 57
Deep Web Crawling Deep Web Crawling: Access content not reachable by links, but only by filling HTML forms. Approach: 1 Locate deep Web content sources, e.g. by focused crawling. 2 Select relevant sources, e.g. by expected coverage or by reputation. 3 Extract underlying content Content Extraction: 1 Select relevant form fields (e.g. exclude sort order and other presentational fields) 2 Detect role of targeted fields (data type, appropriate values) 3 Create database of values, e.g. manually or from Web sources 4 Issue queries, extract content and extends values database Gerhard Gossen Web Crawling 2015-06-04 49 / 57
JavaScript Crawling Originally Web pages were created on the server and did not change once delivered to the client. JavaScript allows the pages to: change their appearance add interactive features load and replace content Crawlers typically do not execute JavaScript and can therefore miss some content. Gerhard Gossen Web Crawling 2015-06-04 50 / 57
JavaScript crawling Crawling can be modeled by states distinct pages transitions ways to move from one state to the other State Transition Web Crawling URL follow link JavaScript Crawling DOM representation potentially any user interaction Gerhard Gossen Web Crawling 2015-06-04 51 / 57
JavaScript Crawling strategy load page for each potential action (click, scroll, hover,... ): execute action wait until JavaScript and possibly AJAX requests have finished compare DOM to original state if DOM has changed, update state model and continue recursively reset to original state Gerhard Gossen Web Crawling 2015-06-04 52 / 57
Challenges in JavaScript Crawling higher computational cost security asynchronous model: JavaScript can execute actions in the background, may change document in unexpected ways detection of relevant changes, ignore e.g. changed advertisements stopping criteria for state model exploration Gerhard Gossen Web Crawling 2015-06-04 53 / 57
What have we discussed today? Prerequisites for Web crawling General model of a crawler Implementation considerations Typical challenges Special applications Gerhard Gossen Web Crawling 2015-06-04 54 / 57
Further reading Christopher Olston and Marc Najork. Web Crawling. In: Foundations and Trends in Information Retrieval 4.3 (2010), pp. 175 246. DOI: 10.1561/1500000017 General introduction Suryakant Choudhary et al. Crawling Rich Internet Applications: The State of the Art. In: CASCON 12. 2012, pp. 146 160 JavaScript crawling Carlos Castillo and Ricardo Baeza-Yates. Practical Web Crawling. Technical Report. University of Chile, 2005. URL: http: //chato.cl/papers/castillo_05_practical_web_crawling.pdf Detailed implementation advice Gerhard Gossen Web Crawling 2015-06-04 55 / 57
References I Carlos Castillo and Ricardo Baeza-Yates. Practical Web Crawling. Technical Report. University of Chile, 2005. URL: http://chato.cl/papers/castillo_05_practical_web_ crawling.pdf. Junghoo Cho and Hector Garcia-Molina. The Evolution of the Web and Implications for an Incremental Crawler. In: VLDB 2000. 2000, pp. 200 209. Junghoo Cho and Hector Garcia-molina. Effective page refresh policies for web crawlers. In: ACM Transactions on Database Systems 28 (2003), p. 2003. Suryakant Choudhary et al. Crawling Rich Internet Applications: The State of the Art. In: CASCON 12. 2012, pp. 146 160. Gerhard Gossen Web Crawling 2015-06-04 56 / 57
References II Anirban Dasgupta et al. The Discoverability of the Web. In: WWW 07. 2007. DOI: 10.1145/1242572.1242630. Alexandros Ntoulas, Junghoo Cho, and Christopher Olston. What s New on the Web?: The Evolution of the Web from a Search Engine Perspective. In: WWW 04. 2004. DOI: 10.1145/988672.988674. Christopher Olston and Marc Najork. Web Crawling. In: Foundations and Trends in Information Retrieval 4.3 (2010), pp. 175 246. DOI: 10.1561/1500000017. Christopher Olston and Sandeep Pandey. Recrawl Scheduling Based on Information Longevity. In: WWW 08. ACM, 2008, pp. 437 446. DOI: 10.1145/1367497.1367557. Gerhard Gossen Web Crawling 2015-06-04 57 / 57