Web Crawling. Advanced methods of Information Retrieval. Gerhard Gossen Gerhard Gossen Web Crawling / 57

Similar documents
Collection Building on the Web. Basic Algorithm

CS47300: Web Information Search and Management

Information Retrieval. Lecture 10 - Web crawling

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

CS6200 Information Retreival. Crawling. June 10, 2015

Introduction to Information Retrieval

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Effective Page Refresh Policies for Web Crawlers

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Information Retrieval and Web Search

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

Search Engines. Information Retrieval in Practice

Web Crawling. Contents

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

FILTERING OF URLS USING WEBCRAWLER

CS 572: Information Retrieval

Information Retrieval Spring Web retrieval

Yioop Full Historical Indexing In Cache Navigation. Akshat Kukreti

Information Retrieval

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

World Wide Web has specific challenges and opportunities

Searching the Web What is this Page Known for? Luis De Alba

Design and implementation of an incremental crawler for large scale web. archives

Around the Web in Six Weeks: Documenting a Large-Scale Crawl

Chapter IR:IX. IX. Acquisition. Crawling the Web Conversion Storing Documents

Information Retrieval May 15. Web retrieval

Crawlers - Introduction

How Does a Search Engine Work? Part 1

INTERNET ENGINEERING. HTTP Protocol. Sadegh Aliakbary

CS47300: Web Information Search and Management

Crawling - part II. CS6200: Information Retrieval. Slides by: Jesse Anderton

Google Search Appliance

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

EECS 395/495 Lecture 5: Web Crawlers. Doug Downey

Mining Web Data. Lijun Zhang

DATA MINING II - 1DL460. Spring 2014"

YIOOP FULL HISTORICAL INDEXING IN CACHE NAVIGATION

Effective Performance of Information Retrieval by using Domain Based Crawler

Google Search Appliance

Efficient Crawling Through Dynamic Priority of Web Page in Sitemap

Produced by. Mobile Application Development. Higher Diploma in Science in Computer Science. Eamonn de Leastar

Part I: Data Mining Foundations

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

An Focused Adaptive Web Crawling for Efficient Extraction of Data From Web Pages

Crawling the Web for. Sebastian Nagel. Apache Big Data Europe

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

Modern Information Retrieval

Searching the Deep Web

DATA MINING - 1DL105, 1DL111

COMP 3400 Programming Project : The Web Spider

OnCrawl Metrics. What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for.

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University

Caching. Caching Overview

WEB TECHNOLOGIES CHAPTER 1

Title: Artificial Intelligence: an illustration of one approach.

Computer Networks. Wenzhong Li. Nanjing University

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Web Crawling As Nonlinear Dynamics

Advanced Crawling Techniques. Outline. Web Crawler. Chapter 6. Selective Crawling Focused Crawling Distributed Crawling Web Dynamics

IJESRT. [Hans, 2(6): June, 2013] ISSN:

Lecture 7b: HTTP. Feb. 24, Internet and Intranet Protocols and Applications

The Web Servers + Crawlers

Information Retrieval Issues on the World Wide Web

Web-Crawling Approaches in Search Engines

The HTTP Protocol HTTP

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

A STUDY ON THE EVOLUTION OF THE WEB

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Application Protocols and HTTP

SEARCH ENGINE INSIDE OUT

Searching the Deep Web

Mining Web Data. Lijun Zhang

3. WWW and HTTP. Fig.3.1 Architecture of WWW

Address: Computer Science Department, Stanford University, Stanford, CA

Parallel Crawlers. Junghoo Cho University of California, Los Angeles. Hector Garcia-Molina Stanford University.

Estimating Page Importance based on Page Accessing Frequency

International Journal of Advanced Research in Computer Science and Software Engineering

Chapter 2: Literature Review

Smartcrawler: A Two-stage Crawler Novel Approach for Web Crawling

A scalable lightweight distributed crawler for crawling with limited resources

Smart Crawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

HTTP Reading: Section and COS 461: Computer Networks Spring 2013

The Web Servers + Crawlers

CS47300 Web Information Search and Management

DATA MINING II - 1DL460. Spring 2017

How to work with HTTP requests and responses

Web Search. Web Spidering. Introduction

EEC-682/782 Computer Networks I

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Transcription:

Web Crawling Advanced methods of Information Retrieval Gerhard Gossen 2015-06-04 Gerhard Gossen Web Crawling 2015-06-04 1 / 57

Agenda 1 Web Crawling 2 How to crawl the Web 3 Challenges 4 Architecture of a Web Crawler 5 Special applications Gerhard Gossen Web Crawling 2015-06-04 2 / 57

Definition The Web contains billions of interesting documents, but they are spread over millions of different machines. Before we can apply IR techniques on them, we need to systematically download them to our machines. The software to download such pages is called a Web crawler or spider (sometimes even bot). Gerhard Gossen Web Crawling 2015-06-04 3 / 57

Use cases Web search engines Web archives Data Mining Web Monitoring Gerhard Gossen Web Crawling 2015-06-04 4 / 57

In this lecture Technical specifications Social protocols Algorithms and data structures Gerhard Gossen Web Crawling 2015-06-04 5 / 57

Agenda 1 Web Crawling 2 How to crawl the Web 3 Challenges 4 Architecture of a Web Crawler 5 Special applications Gerhard Gossen Web Crawling 2015-06-04 6 / 57

Agenda 1 Web Crawling 2 How to crawl the Web 3 Challenges 4 Architecture of a Web Crawler 5 Special applications Gerhard Gossen Web Crawling 2015-06-04 7 / 57

URLs Each document is identified by a Uniform Resource Locator (URL) 1. Scheme Path http://www.example.com /show?doc=123 Host name Query string Scheme typically the access protocol (exceptions: mailto:, javascript:) Host name domain name or IP address of the server hosting the document Path file system-like path to the document Query string (optional) additional data, can be interpreted by software running on the server 1 RFC 3986 Uniform Resource Identifier (URI): Generic Syntax Gerhard Gossen Web Crawling 2015-06-04 8 / 57

Hyperlinks Web documents contain hyperlinks to other documents using their URLs. In HTML: <a href="http://www.example.org/">example link</a> which is displayed as: Example link All hyperlinks together form the link graph of the Web. Links are used (by humans and crawlers) to discover new documents Gerhard Gossen Web Crawling 2015-06-04 9 / 57

MIME types MIME (Multipurpose Internet Mail Extensions) 2 types give the technical format of a document. type parameters (optional) text/html;charset=utf-8 sub-type Originally specified for Email, now used in many Internet applications. IANA maintains the official registry. 3 2 RFC 2046 3 https://www.iana.org/assignments/media-types/media-types.xhtml Gerhard Gossen Web Crawling 2015-06-04 10 / 57

HTTP Most Web documents are accessed via the Hypertext Transfer Protocol (HTTP) 4. HTTP is an application protocol, relies on TCP/IP for network communications a request-response protocol stateless plain text based 4 RFC 7230 7235 Gerhard Gossen Web Crawling 2015-06-04 11 / 57

HTTP example Request GET /index.html HTTP/1.1 Host: www.example.com Response HTTP/1.1 200 OK Date: Mon, 01 Jun 2015 12:32:17 GMT Server: Apache Content-Type: text/html; charset=utf-8 Content-Length: 138 <html>... Gerhard Gossen Web Crawling 2015-06-04 12 / 57

HTTP request headers Both parties can give additional information using request resp. response headers. Typical request headers: Accept-* content types/languages/etc. the client can handle Cookie information from previous request to the server, e.g. to simulate sessions Referer URL of document linking to the currently requested document User-Agent Identifier of the software making the request, should be descriptive and contain contact information 5 5 Cf. http://commoncrawl.org/faqs/ Gerhard Gossen Web Crawling 2015-06-04 13 / 57

HTTP response headers Both parties can give additional information using request resp. response headers. Typical response headers: Content-Type MIME type of the content Caching Last-Modified / ETag / Expires Location redirect to different URL Set-Cookie asks client to send specified information on sub-sequent requests Gerhard Gossen Web Crawling 2015-06-04 14 / 57

HTTP response status codes The server signals the success or the cause of a failure using a status code. Typical status codes are: Code Meaning 1xx Informational 2xx Success 200 OK 3xx Redirection (permanent/temporary) 4xx Client error 403 Forbidden 404 Not found 5xx Server error Gerhard Gossen Web Crawling 2015-06-04 15 / 57

Crawler algorithm def crawler(initialurls): queue = [initialurls] while not queue.is_empty: url = queue.pop doc = download(url) out_urls = extract_links(doc) queue.append(out_urls) store(doc) The initial URLs of a crawl are also called seed URLs. Gerhard Gossen Web Crawling 2015-06-04 16 / 57

Agenda 1 Web Crawling 2 How to crawl the Web 3 Challenges 4 Architecture of a Web Crawler 5 Special applications Gerhard Gossen Web Crawling 2015-06-04 17 / 57

Scalability Crawler fetches documents accross the network from often busy/slow servers (up to several seconds per document) simple crawler waits most of the time parallelization can speed up crawler enormously (easily 100s of parallel fetches per machine) Large web crawls can max out processing / storage / network capacity of single machine distribute crawler accross cluster of crawling machines Gerhard Gossen Web Crawling 2015-06-04 18 / 57

Politeness Powerful crawlers can overload and take down smaller servers. 1 very impolite towards server operators 2 often leads to blocks against the crawler Countermeasure: wait some time between requests to the same server (typically.5 5s) Gerhard Gossen Web Crawling 2015-06-04 19 / 57

Robots Exclusion Protocol (robots.txt) Informal protocol: server operators can create a file called robots.txt, tells bot operators about acceptable behavior. Main function: Allow/disallow downloading of parts of the site. Relies on cooperation of bot operators. Example User-agent: googlebot Disallow: User-agent: * Disallow: / Gerhard Gossen Web Crawling 2015-06-04 20 / 57

Robustness A crawler can encounter many different types of errors: Hardware failure, disk full Network temporary failures, overloaded services (e.g. DNS) Servers slow to respond, send invalid responses Content malformed, size Additionally, server operators can accidentally or on purpose cause unwanted crawler behavior: Crawler traps site has huge number of irrelevant pages (e.g. calendars) Link farms spammers create huge networks of interlinked Web sites The crawler should be able to cope with these problems. Gerhard Gossen Web Crawling 2015-06-04 21 / 57

Prioritization The crawler can have different purposes (often in combination): Download large number of documents Download most relevant documents Download most important documents Keep document collection up to date... Often also a time constraint: e.g. update search index once per week. fixed budget of downloads, need to select URLs with the highest priority. Gerhard Gossen Web Crawling 2015-06-04 22 / 57

Conflicts between challenges Crawler design and operation needs to balance between conflicting challenges: politeness scalability politeness download speed completeness of crawl obeying robots.txt Some of these can be handled by an appropriate crawler architecture, others need human intervention/trade-off decisions. Gerhard Gossen Web Crawling 2015-06-04 23 / 57

Agenda 1 Web Crawling 2 How to crawl the Web 3 Challenges 4 Architecture of a Web Crawler 5 Special applications Gerhard Gossen Web Crawling 2015-06-04 24 / 57

Architecture Crawl process 1 DNS resolver & cache Host names IP addresses HTTP fetcher Link extractor URL distributor URL filter Duplicate URL eliminator URL prioritizer Queue URLs Crawl process 2 URLs DNS resolver & cache Host names IP addresses HTTP fetcher Link extractor URL distributor URL filter Duplicate URL eliminator URL prioritizer Queue Gerhard Gossen Web Crawling 2015-06-04 25 / 57

Crawler components HTTP fetcher fetches and caches robots.txt, then fetches the actual documents DNS resolver and cache resolves host names to IP addresses. Locality of URLs enables efficient caching Link extractor parses the page s HTML content and extracts links URL distributor distribute URLs to responsible crawler process URL filter remove unwanted URLs (spam, irrelevant,... ), normalize remaining URLs (remove session IDs,... ) Duplicate URL eliminator remove already fetched URLs (also called URL-seen test) URL prioritizer assign an expected relevance to the URL (e.g. using PageRank, update frequency, or similar measures) Gerhard Gossen Web Crawling 2015-06-04 26 / 57

Queue / Frontier Data structure that stores unfetched URLs. Requirements: ordered (queue) concurrent scalable to size larger than memory needs to ensure politeness Gerhard Gossen Web Crawling 2015-06-04 27 / 57

Simplest queue implementation: FIFO A first-in-first-out (FIFO) queue is the simplest implementation: Store URLs in a list Dequeue first item for fetcher Append new links at the end Drawbacks: No support for prioritization politeness limits throughput: Documents have typically many links to same host. Queue therefore has long stretches of URLs from the same host. Fetchers need to wait at each step until politeness interval is over. Possible solution: per-host queues Gerhard Gossen Web Crawling 2015-06-04 28 / 57

Per-Host queues: Mercator Front queues Back queues f 1 b 1 URL prioritizer f 2 f 3... front queue selector b 2 b 3... Back queue selector f n b m back queue priority queue Gerhard Gossen Web Crawling 2015-06-04 29 / 57

Mercator queue components Back queues Fixed number of non-empty FIFO queues, contain only URLs from same host Back queue priority queue contains (back queue, t) tuples ordered ascending by t, where t is the next allowed crawl time for the corresponding back queue Front queues FIFO queues containing URLs with (discrete) priority i Front queue selector picks a front queue randomly, but biased towards queues with higher priority URL prioritizer assign a priority [1, n] to new URL Host table Mapping from server name to assigned back queue Parts of the queue may be stored on disk and only brought into memory when needed. Gerhard Gossen Web Crawling 2015-06-04 30 / 57

Mercator queue algorithm Enqueue URLs are added to front queue corresponding to their priority Dequeue dequeue first element (bq, t) from back queue priority queue and wait until time t dequeue first URL from bq, download it if bq is now empty dequeue next URL u from front queues if corresponding back queue exist (cf. host table), insert u there and repeat otherwise: add URL u to bq, update host table enqueue (bq, t now + ) In Mercator, is download time politeness parameter (typically 10) to prefer faster servers. Gerhard Gossen Web Crawling 2015-06-04 31 / 57

Agenda 1 Web Crawling 2 How to crawl the Web 3 Challenges 4 Architecture of a Web Crawler 5 Special applications Gerhard Gossen Web Crawling 2015-06-04 32 / 57

Continuous crawling Changes in the Web Gerhard Gossen Web Crawling 2015-06-04 33 / 57

Continuous crawling Changes in the Web Changing content of already crawled pages Gerhard Gossen Web Crawling 2015-06-04 33 / 57

Continuous crawling Changes in the Web Changing content of already crawled pages Appearance / disappearance of pages Gerhard Gossen Web Crawling 2015-06-04 33 / 57

Continuous crawling Changes in the Web Changing content of already crawled pages Appearance / disappearance of pages Appearance / disappearance of links between pages Gerhard Gossen Web Crawling 2015-06-04 33 / 57

Continuous crawling Changes in the Web Changing content of already crawled pages Appearance / disappearance of pages Appearance / disappearance of links between pages Users of IR system change interests Gerhard Gossen Web Crawling 2015-06-04 33 / 57

Content changes Individual pages change their content often More than 40% change at least daily [CG00] But: No overall pattern, change occurs at different frequencies & time scales (seconds to years) Figure: Change rate of Web pages Frequency can be modelled as a Poisson process: With X (t)) number of occurrence of a change in (0, t] λ change rate for k = 0, 1,... Pr{X (s + t) X (t) = k} = (λt)k e λt k! Gerhard Gossen Web Crawling 2015-06-04 34 / 57

Change types [OP08] Temporal behavior of page (regions) can be classified as static no changes churn new content supplants old content, e.g., quote of the day scroll new content is appended to old content, e.g., blog entries Gerhard Gossen Web Crawling 2015-06-04 35 / 57

The Web changes [NCO04; Das+07] New pages are created at a rate of 8% per week During one year 80% of pages disappear New links are created at the rate of 25% per week significantly faster than the rate of new page creation Links are retired at about the same pace as pages Gerhard Gossen Web Crawling 2015-06-04 36 / 57

Users change User interests change Goals of IR system maintainer change requires adaptation of crawl strategy Gerhard Gossen Web Crawling 2015-06-04 37 / 57

Re-Crawling Strategies Crawlers need to balance different considerations: Coverage fetch new pages Freshness find updates of existing pages Figure: Crawl ordering model [ON10] Gerhard Gossen Web Crawling 2015-06-04 38 / 57

Basic strategies Batch crawling Stop and restart crawl process periodically Each document is only crawled one time per crawl Incremental crawling Crawling is run continuously A document can be crawled multiple times during a crawl Crawl frequency can differ between different sites Batch crawling is easier to implement, incremental more powerful. Gerhard Gossen Web Crawling 2015-06-04 39 / 57

Batch Crawling Strategies Goal is to maximize Weighted Coverage: WC(t) = w(p), p C(t) with t time since start of crawl C(t) pages crawled until time t w(p) weight of page p, (0 p 1) Main strategy types (ordered by complexity): Breadth-first search Order by in-degree Order by PageRank Figure: Weighted coverage as a function of time t Gerhard Gossen Web Crawling 2015-06-04 40 / 57

Incremental Crawling Strategies Goal is to maximize Weighted Freshness: WF (t) = w(p) f (p, t), p C(t) with f (p, t): freshness level of page p at time t Steady state average of WF: 1 t WF = lim WF (t)dt t t o Trade-off between coverage and freshness: Often treated as a business decision, needs to be tuned towards goals of specific application Gerhard Gossen Web Crawling 2015-06-04 41 / 57

Maximizing Freshness [CG03] Model estimation create a temporal model for each page p Resource allocation Given a maximum crawl rate r, decide on a revisitation frequency r(p) for each page Scheduling Produce a crawl order that implements the targeted revisitation frequencies as close as possible Gerhard Gossen Web Crawling 2015-06-04 42 / 57

Model estimation Create temporal model of temporal behavior of p given samples of past content p / pages similar to p Samples are often not be evenly-spaced Content can give hints about change frequency HTTP headers, number of links, depth of page in site Similar pages have similar behavior same site similar content similar link structure Gerhard Gossen Web Crawling 2015-06-04 43 / 57

Resource allocation Binary Freshness model { 1 if old copy is equal to live copy f (p, t) = 0 otherwise Intuitively good strategy: proportional resource allocation assign revisitation frequency proportional to change frequency But: uniform resource allocation achieves better average binary freshness assuming equal page weights Reason: Pages with high change frequency are stale very often regardless of crawl frequency (A) Pages with lower change frequency can be kept fresh more easily (B) Better to keep several pages of type B fresh than wasting resources on A Gerhard Gossen Web Crawling 2015-06-04 44 / 57

Resource allocation (continued) Continuous freshness model { 0 if old copy is equal to live copy age(p, t) = a otherwise a is the amount of time between cached and live copy revisitation frequency increases with change frequency a increases monotonically, crawler cannot give up on a page Instead of age, crawler can also consider content changes directly distinguish between long-lived and ephemeral content Gerhard Gossen Web Crawling 2015-06-04 45 / 57

Scheduling Goal: Produce a crawl ordering that implements the targeted revisitation frequencies as close as possible Uniform spacing of downloads of p achieves best results. Gerhard Gossen Web Crawling 2015-06-04 46 / 57

Focused Crawling Crawling the whole Web is very expensive, but often only documents about specific topics are relevant. Basic assumption: link homogeneity, pages preferentially link to similar pages links from relevant pages are typically more relevant. Example applications: vertical search engines (e.g. hotels, jobs) data mining Web archives Gerhard Gossen Web Crawling 2015-06-04 47 / 57

Deep Web Crawling / JavaScript crawling So far we only crawl by following links. But not all content is accessible through links: Web forms JavaScript applications Gerhard Gossen Web Crawling 2015-06-04 48 / 57

Deep Web Crawling Deep Web Crawling: Access content not reachable by links, but only by filling HTML forms. Approach: 1 Locate deep Web content sources, e.g. by focused crawling. 2 Select relevant sources, e.g. by expected coverage or by reputation. 3 Extract underlying content Content Extraction: 1 Select relevant form fields (e.g. exclude sort order and other presentational fields) 2 Detect role of targeted fields (data type, appropriate values) 3 Create database of values, e.g. manually or from Web sources 4 Issue queries, extract content and extends values database Gerhard Gossen Web Crawling 2015-06-04 49 / 57

JavaScript Crawling Originally Web pages were created on the server and did not change once delivered to the client. JavaScript allows the pages to: change their appearance add interactive features load and replace content Crawlers typically do not execute JavaScript and can therefore miss some content. Gerhard Gossen Web Crawling 2015-06-04 50 / 57

JavaScript crawling Crawling can be modeled by states distinct pages transitions ways to move from one state to the other State Transition Web Crawling URL follow link JavaScript Crawling DOM representation potentially any user interaction Gerhard Gossen Web Crawling 2015-06-04 51 / 57

JavaScript Crawling strategy load page for each potential action (click, scroll, hover,... ): execute action wait until JavaScript and possibly AJAX requests have finished compare DOM to original state if DOM has changed, update state model and continue recursively reset to original state Gerhard Gossen Web Crawling 2015-06-04 52 / 57

Challenges in JavaScript Crawling higher computational cost security asynchronous model: JavaScript can execute actions in the background, may change document in unexpected ways detection of relevant changes, ignore e.g. changed advertisements stopping criteria for state model exploration Gerhard Gossen Web Crawling 2015-06-04 53 / 57

What have we discussed today? Prerequisites for Web crawling General model of a crawler Implementation considerations Typical challenges Special applications Gerhard Gossen Web Crawling 2015-06-04 54 / 57

Further reading Christopher Olston and Marc Najork. Web Crawling. In: Foundations and Trends in Information Retrieval 4.3 (2010), pp. 175 246. DOI: 10.1561/1500000017 General introduction Suryakant Choudhary et al. Crawling Rich Internet Applications: The State of the Art. In: CASCON 12. 2012, pp. 146 160 JavaScript crawling Carlos Castillo and Ricardo Baeza-Yates. Practical Web Crawling. Technical Report. University of Chile, 2005. URL: http: //chato.cl/papers/castillo_05_practical_web_crawling.pdf Detailed implementation advice Gerhard Gossen Web Crawling 2015-06-04 55 / 57

References I Carlos Castillo and Ricardo Baeza-Yates. Practical Web Crawling. Technical Report. University of Chile, 2005. URL: http://chato.cl/papers/castillo_05_practical_web_ crawling.pdf. Junghoo Cho and Hector Garcia-Molina. The Evolution of the Web and Implications for an Incremental Crawler. In: VLDB 2000. 2000, pp. 200 209. Junghoo Cho and Hector Garcia-molina. Effective page refresh policies for web crawlers. In: ACM Transactions on Database Systems 28 (2003), p. 2003. Suryakant Choudhary et al. Crawling Rich Internet Applications: The State of the Art. In: CASCON 12. 2012, pp. 146 160. Gerhard Gossen Web Crawling 2015-06-04 56 / 57

References II Anirban Dasgupta et al. The Discoverability of the Web. In: WWW 07. 2007. DOI: 10.1145/1242572.1242630. Alexandros Ntoulas, Junghoo Cho, and Christopher Olston. What s New on the Web?: The Evolution of the Web from a Search Engine Perspective. In: WWW 04. 2004. DOI: 10.1145/988672.988674. Christopher Olston and Marc Najork. Web Crawling. In: Foundations and Trends in Information Retrieval 4.3 (2010), pp. 175 246. DOI: 10.1561/1500000017. Christopher Olston and Sandeep Pandey. Recrawl Scheduling Based on Information Longevity. In: WWW 08. ACM, 2008, pp. 437 446. DOI: 10.1145/1367497.1367557. Gerhard Gossen Web Crawling 2015-06-04 57 / 57