Comparison of Web Archives created by HTTrack and Heritrix (H3) and the conversion of HTTrack Web Archives to the Web ARChive File Format (WARC)

Similar documents
Web Archiving Workshop

Integration of non harvested web data into an existing web archive

Information and documentation WARC file format Information et documentation Format de fichier WARC

YIOOP FULL HISTORICAL INDEXING IN CACHE NAVIGATION

Full-Text Indexing For Heritrix

Wayback for Accessing Web Archives

Preserving Legal Blogs

CS6200 Information Retreival. Crawling. June 10, 2015

WEB ARCHIVE COLLECTING POLICY

Search Engines. Information Retrieval in Practice

Web Archiving at UTL

Web-Archiving: Collecting and Preserving Important Web-based National Resources

You got a website. Now what?

Overview of the Netarkivet web archiving system

Collection Building on the Web. Basic Algorithm

The MDR: A Grand Experiment in Storage & Preservation

SEO Technical & On-Page Audit

Metadata for general purposes

Policies to Resolve Archived HTTP Redirection

KANA Enterprise Knowledge Management Administration Guide

URLs excluded by REP may still appear in a search engine index.

Crawling the Web for. Sebastian Nagel. Apache Big Data Europe

Full-Text Indexing for Heritrix

Archiving the Web: What can Institutions learn from National and International Web Archiving Initiatives

Entrust. Discovery 2.4. Administration Guide. Document issue: 3.0. Date of issue: June 2014

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

The Internet Archive and The Wayback Machine

INLS : Introduction to Information Retrieval System Design and Implementation. Fall 2008.

Safe Havens in a Choppy Sea:

Web Search An Application of Information Retrieval Theory

Website review excitesubmit.com

Web server reconnaissance

Only applies where the starting URL specifies a starting location other than the root folder. For example:

Archiving and Preserving the Web. Kristine Hanna Internet Archive November 2006

InfraStruxure Central 6.0 Release Notes

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

vsphere Update Manager Installation and Administration Guide 17 APR 2018 VMware vsphere 6.7 vsphere Update Manager 6.7

Persistent identifiers, long-term access and the DiVA preservation strategy

CS 297 Report. Yioop! Full Historical Indexing In Cache Navigation. Akshat Kukreti SJSU ID:

Yioop Full Historical Indexing In Cache Navigation. Akshat Kukreti

Topics Augmenting Application.cfm with Filters. What a filter can do. What s a filter? What s it got to do with. Isn t it a java thing?

A Framework for Bridging the Gap Between Open Source Search Tools

RESTFUL WEB SERVICES - INTERVIEW QUESTIONS

Community Tools and Best Practices for Harvesting and Preserving At-Risk Web Content ACA 2013

Coveo Platform 7.0. Yammer Connector Guide

Session 8. Reading and Reference. en.wikipedia.org/wiki/list_of_http_headers. en.wikipedia.org/wiki/http_status_codes

CS47300: Web Information Search and Management

warcinfo: contains information about the files within the WARC response: contains the full http response

Five9 Adapter for Oracle

SmartAnalytics. Manual

FUSION REGISTRY COMMUNITY EDITION SETUP GUIDE VERSION 9. Setup Guide. This guide explains how to install and configure the Fusion Registry.

Oracle Universal Records Management Oracle Universal Records Manager Adapter for FileSystem Administration Guide

Scan Report Executive Summary. Part 2. Component Compliance Summary IP Address :

DRS 2 Glossary. access flag An object access flag records the least restrictive access flag recorded for one of the object s files: ο ο

Open Archives Initiative protocol development and implementation at arxiv

LEAD Information Model

Atlassian Confluence Connector

CDL s Web Archiving System

Appendix REPOX User Manual

FINALTERM EXAMINATION Spring 2009 CS506- Web Design and Development Solved by Tahseen Anwar

NDSA Web Archiving Survey

ediscovery 6.1 and Patches Release Notes

Google Search Appliance

Research Data Repository Interoperability Primer

Copyright 2010 Redstone Content Solutions LLC OCM & WCM Training Agenda Revised Thursday, November 18, 2010

Web Crawler Middleware for Search Engine Digital Libraries: A Case Study for CiteSeerX

Administrative User Guide. (Last Updated: January 2018) Covers Version 1.6.x

MarkLogic Server. Information Studio Developer s Guide. MarkLogic 8 February, Copyright 2015 MarkLogic Corporation. All rights reserved.

Continuous Integration (CI) with Jenkins

Chilkat: crawling. Marlon Dias Information Retrieval DCC/UFMG

Developing and Deploying vsphere Solutions, vservices, and ESX Agents. 17 APR 2018 vsphere Web Services SDK 6.7 vcenter Server 6.7 VMware ESXi 6.

Google Search Appliance

Detects Potential Problems. Customizable Data Columns. Support for International Characters

Store and Report Waters Empower Data with OpenLAB ECM and OpenLAB ECM Intelligent Reporter

Information Retrieval Spring Web retrieval

Features & Functionalities

Information Retrieval. Lecture 10 - Web crawling

Million Book Universal Library Project :Manual for Metadata Capture, Digitization, and OCR

INTRODUCTION... 3 INSTALLATION GUIDE FOR ECLIPSE 3.1 AND INSTALLATION GUIDE FOR ECLIPSE 3.3 TO

How Do I Manage Active Directory

EventLog Monitor Tool Setup

What is Eclipse? A free copy can be downloaded at:

How Does a Search Engine Work? Part 1

Putting it all together

How A Website Works. - Shobha

CS 347 Parallel and Distributed Data Processing

it is

User Manual Version August 2011

CS 347 Parallel and Distributed Data Processing

The main differences with other open source reporting solutions such as JasperReports or mondrian are:

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0

User Manual. Version September 2013

CoreBlox Integration Kit. Version 2.2. User Guide

Features & Functionalities

Produced by. Mobile Application Development. Higher Diploma in Science in Computer Science. Eamonn de Leastar

Collecting information

Module 3 Web Component

1. Name of Your Organization. 2. About Your Organization. Page 1. nmlkj. nmlkj. nmlkj. nmlkj. nmlkj. nmlkj. nmlkj. nmlkj. nmlkj. nmlkj. nmlkj.

Website review dafont.com

GroupWise 18 Administrator Quick Start

Transcription:

Comparison of Web Archives created by HTTrack and Heritrix (H3) and the conversion of HTTrack Web Archives to the Web ARChive File Format (WARC) Barbara Löhle Bibliotheksservice-Zentrum Baden-Württemberg Freiburg, 15 November 2012 1

Overview Introduction of Web Archives using SWBcontent laboratory scale downloads HTTrack Web Archives Heritrix 3.x (H3) using ARC and WARC Formats Java program: httrack2arc Concept: Conversion of HTTrack Archives to WARC using H3 2

SWBcontent The web application SWBcontent is the technical base of different installations, e.g. the Baden- Würtemberg Online Archive (BOA), with the main functionality of webharvesting in the context of libraries and archives. Performing the following tasks: collecting, deducing, presenting, preserving web pages und online published documents. SWBcontent integrates the two web crawlers: HTTrack Heritrix 3.x (Presentation of ARCs and Warcs: Wayback machine) 3

Installations of SWBcontent 4

Integration of Heritrix 3.x HTTP HTTP wayback Web-Harvesting SWBcontent HTTPS Jetty-Webserver H3 5

Presentation of a HTTrack result 6

Presentation of a Heritrix 3.x result 7

HTTrack Web Archives Description of HTTrack hts-cache directory collecting available data Example 1 single pdf file download Example 2 html download 8

HTTrack 3.46-x HTTrack is a popular open source web crawler, which is written in C. The HTTrack web archive is based on the File and Folder strategy. Each harvested URL ist stored in a seperate file. The path and filename is created from the original URL (compare Christensen, 2004). The HTTrack crawls create a large number of small files in the filesystem, which are difficult to handle by the operating system. Downloaded html pages are always modified to fit the local filesystem structure. (compare HTTrack) A crawl is configured by command line parameters, e.g. obey robots.txt: -s2 ignore robots.txt: -s0 User agent: -F Mozilla 1.0... 9

HTTrack 3.46-x A crawl can be restarted on the base of the previous crawl. For this it is necessary, that the hts-cache is created and updated. The hts-cache contains among others a zip-file with the unmodified html pages. Under the consideration of converting HTTrack created data to ARC or WARC it is necessary to maintain the unmodified html pages and the maximum of preserved data of the response of the requested web server. Therfore a detailed analysis of the content of the hts-cache directory is necessary. 10

hts-cache hts-log.txt: contains the error and warning messages of HTTrack during a crawl and the statistics of a crawl duration of the crawl number of links scanned number of files written average bandwidth hts-cache/new.txt: presents on a per URL base the metadata log of the downloaded or requested URL and the path information of the downloaded file. hts-cache/new.zip: includes the original download structure; preserves in the case of text files the original files. The extra field of the local file header of each zip file entry contains additional data similar to HTTP header fields. hts.cache/doit.log: contains the used HTTrack command line and the start time of the crawl. 11

hts-cache/new.zip Description of the usage of the extra field of the local file header of a new.zip entry, which is part of the function cache_add(...) of src/htscache.c (HTTrack source code). 12

Example 1 - single pdf download 13 Dr.

Example 1 - single pdf download Description of the listed data of any downloaded URL is included in the function back_finalize(...) of src/htsback.c (HTTrack source code). Simple case of a successful downloaded pdf file: 14

Example 2 html download Short excerpt of the generated new.txt file. The case of the not found robots.txt is interesting. 15

Example 2 html download Robots.txt error message of the web server extra field of the local file header 16

H3 - ARC and WARC Formats Heritrix Web archive formats ARC - Warc ARC WARC WARC example single pdf File Comparison ARC WARC ('response' only) single pdf file 17

Heritrix Heritrix the open source web crawler of the Internet Archive, is written in Java and has been being develoved since 2004. The Heritrix package provides a web application, the Web Administrative Console, hosted by the embedded Jetty Java HTTP server. Heritrix is available in 2 major releases Heritrix 1.14.x (current buildt: heritrix-1.15.5), mainly maintenance changes. Heritrix 3.1.x (H3; current buildt: heritrix-3.1.2-snapshot) Main differences between Heritrix 3.1.x and Heritrix 1.14.x H3 uses the application development framework Spring 3.x. the complex configuration of H3 is realized by a Spring Bean. H3 is RESTful, this means H3 uses the Representational State Transfer (REST) to support HTTPS based client communication. 18

Web archive formats File + Folder Each harvested File ist stored in a seperate file, e.g. HTTrack. There exists the problem, that this directory structure is not selfcontained. Important information, e.g. operator contact, oraganization, robots policy, are not preserved together with the data. self-contained structured text with embedded payload data (ARC and WARC) The strategy consists in aggregating the large number of downloaded files (one file per URL) in a small number of text files. Such a text file contains a squence of document records. Metadata discribing the crawl are placed at the beginning of the file. 19 Dr.

ARC - WARC The ARC archival storage format was developed and used by the Internet Archive to store data (version 1.0 1996; current used version 1.1). The Web ARChive (WARC) archival storage format designed for longterm storage of web crawls is an extension of the ARC format. WARC became an ISO Standard: ISO 28500:2009 in Mai 2009. The ARC and the WARC format are non-xml file formats. Heritrix 1.14x and H3 create ARC and WARC Files Heritrix 1.14x: writes ARC Files as default. H3: writes WARC files as default. 20

ARC The ARC file format is a text Format embedding data. Using the ARC format the creation of a small number of large files (up to 1 GByte per file) is possible, This embedded data are organized as document records. These records start with the URI containing header line, followed first by metadata of the requested URI (often HTTP-header) and then by the the downloaded data. The extra metadata of the web crawl are written in XML and are the first record of the ARC file. The extra metadata contains: e.g. used crawler software IP and hostname of the host creating the ARC File. contact of the crawl operator. handling of the robots.txt (compare HTTrack) user agent (compare HTTrack) 21

WARC The WARC file format is focused to store the payload content as well as the control information of important application layer protocols. Therefore characteristic request response type of communication is recorded. A warc file consists out of a sequence of warc records. Essential are the 8 different 'record types' per warc record: 'warcinfo' usally the at the beginning of a warc file describes the warc records following. 'warcinfo' contains optional fields, e.g. operator, software, robots.., (equivalent to metadata in ARC and the metadata bean of the crawler-beans.cxml) and all DCMI (Dublin core Metadata Initiative). 'response' includes the usal response of a requested server, e.g. http response of a web server. 'resource' contains a 'response' without full protocol response information. 22

WARC 'request' - holds as in the 'response' the complete schemespecific request (including network protocol communications). 'metadata' are additional content in the context of harvested resources. 'revisit' in the context of revisitation of already archived content. 'conversion' contains the alternate content of another record's content. 'continuation' formal reasons; to the case of multi-part warc-file. Regarding the available Data of HTTrack only the red marked record types are relevant. 23

H3 Download WARC common 'warcinfo' using the WARC-Filename template: ${prefix}-${timestamp17}-${serialno}-${heritrix.pid}~${heritrix.hostname}~${heritrix.port} 24

H3 Download WARC common 'response': dns IP lookup 25

'response' 'request' 'metadata': robots.txt H3 Download WARC common 26

'response' 'request' 'metadata': robots.txt H3 Download WARC common 27

'response' 'request' 'metadata': robots.txt H3 Download WARC common 28

Concept: HTTrack to - WARC HTTrack data contains only the response of the requested web server. Therefore only the WARC record-type 'response' can be created. In warc record-type 'resource' offers the interesting possibility to convert HTTrack single pdf downloads without created hts-cache to the warc format. In the HTTrack data there doesn't exist the IP of the requested webserver. The 'metadata' bean of the H3 crawler-beans.cxml is the base of the 'warcinfo' optional fields (or the ARC metadata). It's essential to collect the data of the bean from the different resources: HTTrack SWBcontent 29

Crawler-bean.cxml - metadata HTTrack-Parameter: robotspolicyname, usertagenttemplate The rest of the properties can be taken directly from SWBcontent or should be created depending on the harvesting institution. 30

H3 WARC H3 ARC The crawler-bean.cxml bean 'warcwriter' can be configured in a way that the record-types 'request' and 'metadata' are not written. In this case the ARC and WARC format are nearly equal. One should take into accout thet H3 offers standalone converters (Classes) Arc2Warc and Warc2Arc. Further there exists the ArcUtils and the WarcUtils Class whith methods to check if the given format file are correctly written. 31

Java program: httrack2arc There exists the project: http://code.google,com/httrac2arc/ The heritrix-1.14.4.jar is used. The extra field of the local file header of each zip file is not evaluated. Because HTTrack data contains no hint of the IP the fixed IP=1.1.1.1 is used. Self-caontained metadata are not taken int account. If one wants to use this progarm one should create ArcWriter examples with Heritrix 1.14.4.x or H3 to compare the results. 32

httrack2arc - example 33

Concept: HTTrack to - WARC Usage of heritrix-commons.3.2.x.jar, especially WarcWriter class. Evaluate the HTTrack hts-cache directory modelling of the used metadata bean of the crawler- beans.cxml Using data of the SWBcontent database. 34

End Thank you for your attention. Are there any questions or comments? 35