Comparison of Web Archives created by HTTrack and Heritrix (H3) and the conversion of HTTrack Web Archives to the Web ARChive File Format (WARC)

Comparison of Web Archives created by HTTrack and Heritrix (H3) and the conversion of HTTrack Web Archives to the Web ARChive File Format (WARC) Barbara Löhle Bibliotheksservice-Zentrum Baden-Württemberg Freiburg, 15 November 2012 1

Overview Introduction of Web Archives using SWBcontent laboratory scale downloads HTTrack Web Archives Heritrix 3.x (H3) using ARC and WARC Formats Java program: httrack2arc Concept: Conversion of HTTrack Archives to WARC using H3 2

SWBcontent The web application SWBcontent is the technical base of different installations, e.g. the Baden- Würtemberg Online Archive (BOA), with the main functionality of webharvesting in the context of libraries and archives. Performing the following tasks: collecting, deducing, presenting, preserving web pages und online published documents. SWBcontent integrates the two web crawlers: HTTrack Heritrix 3.x (Presentation of ARCs and Warcs: Wayback machine) 3

Installations of SWBcontent 4

Integration of Heritrix 3.x HTTP HTTP wayback Web-Harvesting SWBcontent HTTPS Jetty-Webserver H3 5

Presentation of a HTTrack result 6

Presentation of a Heritrix 3.x result 7

HTTrack Web Archives Description of HTTrack hts-cache directory collecting available data Example 1 single pdf file download Example 2 html download 8

HTTrack 3.46-x HTTrack is a popular open source web crawler, which is written in C. The HTTrack web archive is based on the File and Folder strategy. Each harvested URL ist stored in a seperate file. The path and filename is created from the original URL (compare Christensen, 2004). The HTTrack crawls create a large number of small files in the filesystem, which are difficult to handle by the operating system. Downloaded html pages are always modified to fit the local filesystem structure. (compare HTTrack) A crawl is configured by command line parameters, e.g. obey robots.txt: -s2 ignore robots.txt: -s0 User agent: -F Mozilla 1.0... 9

HTTrack 3.46-x A crawl can be restarted on the base of the previous crawl. For this it is necessary, that the hts-cache is created and updated. The hts-cache contains among others a zip-file with the unmodified html pages. Under the consideration of converting HTTrack created data to ARC or WARC it is necessary to maintain the unmodified html pages and the maximum of preserved data of the response of the requested web server. Therfore a detailed analysis of the content of the hts-cache directory is necessary. 10

hts-cache hts-log.txt: contains the error and warning messages of HTTrack during a crawl and the statistics of a crawl duration of the crawl number of links scanned number of files written average bandwidth hts-cache/new.txt: presents on a per URL base the metadata log of the downloaded or requested URL and the path information of the downloaded file. hts-cache/new.zip: includes the original download structure; preserves in the case of text files the original files. The extra field of the local file header of each zip file entry contains additional data similar to HTTP header fields. hts.cache/doit.log: contains the used HTTrack command line and the start time of the crawl. 11

hts-cache/new.zip Description of the usage of the extra field of the local file header of a new.zip entry, which is part of the function cache_add(...) of src/htscache.c (HTTrack source code). 12

Example 1 - single pdf download 13 Dr.

Example 1 - single pdf download Description of the listed data of any downloaded URL is included in the function back_finalize(...) of src/htsback.c (HTTrack source code). Simple case of a successful downloaded pdf file: 14

Example 2 html download Short excerpt of the generated new.txt file. The case of the not found robots.txt is interesting. 15

Example 2 html download Robots.txt error message of the web server extra field of the local file header 16

H3 - ARC and WARC Formats Heritrix Web archive formats ARC - Warc ARC WARC WARC example single pdf File Comparison ARC WARC ('response' only) single pdf file 17

Heritrix Heritrix the open source web crawler of the Internet Archive, is written in Java and has been being develoved since 2004. The Heritrix package provides a web application, the Web Administrative Console, hosted by the embedded Jetty Java HTTP server. Heritrix is available in 2 major releases Heritrix 1.14.x (current buildt: heritrix-1.15.5), mainly maintenance changes. Heritrix 3.1.x (H3; current buildt: heritrix-3.1.2-snapshot) Main differences between Heritrix 3.1.x and Heritrix 1.14.x H3 uses the application development framework Spring 3.x. the complex configuration of H3 is realized by a Spring Bean. H3 is RESTful, this means H3 uses the Representational State Transfer (REST) to support HTTPS based client communication. 18

Web archive formats File + Folder Each harvested File ist stored in a seperate file, e.g. HTTrack. There exists the problem, that this directory structure is not selfcontained. Important information, e.g. operator contact, oraganization, robots policy, are not preserved together with the data. self-contained structured text with embedded payload data (ARC and WARC) The strategy consists in aggregating the large number of downloaded files (one file per URL) in a small number of text files. Such a text file contains a squence of document records. Metadata discribing the crawl are placed at the beginning of the file. 19 Dr.

ARC - WARC The ARC archival storage format was developed and used by the Internet Archive to store data (version 1.0 1996; current used version 1.1). The Web ARChive (WARC) archival storage format designed for longterm storage of web crawls is an extension of the ARC format. WARC became an ISO Standard: ISO 28500:2009 in Mai 2009. The ARC and the WARC format are non-xml file formats. Heritrix 1.14x and H3 create ARC and WARC Files Heritrix 1.14x: writes ARC Files as default. H3: writes WARC files as default. 20

ARC The ARC file format is a text Format embedding data. Using the ARC format the creation of a small number of large files (up to 1 GByte per file) is possible, This embedded data are organized as document records. These records start with the URI containing header line, followed first by metadata of the requested URI (often HTTP-header) and then by the the downloaded data. The extra metadata of the web crawl are written in XML and are the first record of the ARC file. The extra metadata contains: e.g. used crawler software IP and hostname of the host creating the ARC File. contact of the crawl operator. handling of the robots.txt (compare HTTrack) user agent (compare HTTrack) 21

WARC The WARC file format is focused to store the payload content as well as the control information of important application layer protocols. Therefore characteristic request response type of communication is recorded. A warc file consists out of a sequence of warc records. Essential are the 8 different 'record types' per warc record: 'warcinfo' usally the at the beginning of a warc file describes the warc records following. 'warcinfo' contains optional fields, e.g. operator, software, robots.., (equivalent to metadata in ARC and the metadata bean of the crawler-beans.cxml) and all DCMI (Dublin core Metadata Initiative). 'response' includes the usal response of a requested server, e.g. http response of a web server. 'resource' contains a 'response' without full protocol response information. 22

WARC 'request' - holds as in the 'response' the complete schemespecific request (including network protocol communications). 'metadata' are additional content in the context of harvested resources. 'revisit' in the context of revisitation of already archived content. 'conversion' contains the alternate content of another record's content. 'continuation' formal reasons; to the case of multi-part warc-file. Regarding the available Data of HTTrack only the red marked record types are relevant. 23

H3 Download WARC common 'warcinfo' using the WARC-Filename template: ${prefix}-${timestamp17}-${serialno}-${heritrix.pid}~${heritrix.hostname}~${heritrix.port} 24

H3 Download WARC common 'response': dns IP lookup 25

'response' 'request' 'metadata': robots.txt H3 Download WARC common 26

'response' 'request' 'metadata': robots.txt H3 Download WARC common 27

'response' 'request' 'metadata': robots.txt H3 Download WARC common 28

Concept: HTTrack to - WARC HTTrack data contains only the response of the requested web server. Therefore only the WARC record-type 'response' can be created. In warc record-type 'resource' offers the interesting possibility to convert HTTrack single pdf downloads without created hts-cache to the warc format. In the HTTrack data there doesn't exist the IP of the requested webserver. The 'metadata' bean of the H3 crawler-beans.cxml is the base of the 'warcinfo' optional fields (or the ARC metadata). It's essential to collect the data of the bean from the different resources: HTTrack SWBcontent 29

Crawler-bean.cxml - metadata HTTrack-Parameter: robotspolicyname, usertagenttemplate The rest of the properties can be taken directly from SWBcontent or should be created depending on the harvesting institution. 30

H3 WARC H3 ARC The crawler-bean.cxml bean 'warcwriter' can be configured in a way that the record-types 'request' and 'metadata' are not written. In this case the ARC and WARC format are nearly equal. One should take into accout thet H3 offers standalone converters (Classes) Arc2Warc and Warc2Arc. Further there exists the ArcUtils and the WarcUtils Class whith methods to check if the given format file are correctly written. 31

Java program: httrack2arc There exists the project: http://code.google,com/httrac2arc/ The heritrix-1.14.4.jar is used. The extra field of the local file header of each zip file is not evaluated. Because HTTrack data contains no hint of the IP the fixed IP=1.1.1.1 is used. Self-caontained metadata are not taken int account. If one wants to use this progarm one should create ArcWriter examples with Heritrix 1.14.4.x or H3 to compare the results. 32

httrack2arc - example 33

Concept: HTTrack to - WARC Usage of heritrix-commons.3.2.x.jar, especially WarcWriter class. Evaluate the HTTrack hts-cache directory modelling of the used metadata bean of the crawler- beans.cxml Using data of the SWBcontent database. 34

End Thank you for your attention. Are there any questions or comments? 35