Julien Masanès. Web Archiving. With 28 Figures and 6 Tables ABC

Similar documents
Interfacing with C++

Advanced Data Mining Techniques

Contributions to Economics

High Availability and Disaster Recovery

Enabling Technologies for Wireless E-Business

Jinkun Liu Xinhua Wang. Advanced Sliding Mode Control for Mechanical Systems. Design, Analysis and MATLAB Simulation

Software Development for SAP R/3

Geometric Modeling and Algebraic Geometry

The Information Retrieval Series. Series Editor W. Bruce Croft

Research on Industrial Security Theory

Gengsheng Lawrence Zeng. Medical Image Reconstruction. A Conceptual Tutorial

Philipp von Weitershausen. Web Component Development with Zope 3

Real-Time Graphics Rendering Engine

Lecture Notes in Mathematics Editors: J.--M. Morel, Cachan F. Takens, Groningen B. Teissier, Paris

Windows 10 Revealed. The Universal Windows Operating System for PC, Tablets, and Windows Phone. Kinnary Jangla

Guide to OSI and TCP/IP Models

George Grätzer. Practical L A TEX

Human-Survey Interaction

Computer Architecture

ITIL 2011 At a Glance. John O. Long

MATLAB Programming for Numerical Analysis. César Pérez López

Parallel Programming

SpringerBriefs in Computer Science

Grid Computing Security

c-xsc R. Klatte U. Kulisch A. Wiethoff C. Lawo M. Rauch A C++ Class Library for Extended Scientific Computing Springer-Verlag Berlin Heidelberg GmbH

Lecture Notes in Computer Science 2001 Edited by G. Goos, J. Hartmanis and J. van Leeuwen

Computer Science Workbench. Editor: Tosiyasu L. Kunii

Enabling Semantic Web Services

Wide Area 2D/3D Imaging

Graphics Programming in c++

Lecture Notes in Computer Science

The Architectural Logic of Database Systems

Stefan Waldmann. Topology. An Introduction

Iterative Design of Teaching-Learning Sequences

Philip Andrew Simpson. FPGA Design. Best Practices for Team-based Reuse. Second Edition

Springer-Verlag Berlin Heidelberg GmbH

Failure-Modes-Based Software Reading

Robust SRAM Designs and Analysis

Lecture Notes in Computer Science. Edited by G. Goos, J. Hartmanis and J. van Leeuwen

Mathematics and Visualization

C Quick Syntax Reference

Learn PHP 7. Object-Oriented Modular Programming using HTML5, CSS3, JavaScript, XML, JSON, and MySQL. Steve Prettyman

JavaScript Quick Syntax Reference

Objective-C Quick Syntax Reference

Multidimensional Queueing Models in Telecommunication Networks

Mobile Phone Security and Forensics

Form-Oriented Analysis

Learning to Rank for Information Retrieval

Intelligent Systems Reference Library

Lecture Notes in Computer Science

Stereo Scene Flow for 3D Motion Analysis

August 14th - 18th 2005, Oslo, Norway. Web crawling : The Bibliothèque nationale de France experience

MATLAB Numerical Calculations. César Pérez López

Functional Programming in R

Foundations of 3D Graphics Programming

Computer Communications and Networks. Series editor A.J. Sammes Centre for Forensic Computing Cranfield University, Shrivenham campus Swindon, UK

C++ Quick Syntax Reference

Lecture Notes Electrical Engineering Volume 53

Inside Relational Databases with Examples in Access

Low Level X Window Programming

Semantic Digital Libraries

Data-Centric Systems and Applications

Lecture Notes in Computational Science and Engineering 53

Essential Angular for ASP.NET Core MVC

Computing with Memory for Energy-Efficient Robust Systems

Web Development with Java

Computational Geometry - Algorithms and Applications

Lecture Notes in Computer Science 2929 Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

Polymeric Biomaterials for Tissue Regeneration

Internet Cool Tools for Physicians

Similarity and Compatibility in Fuzzy Set Theory

Graph Data Model. and Its Data Language. Hideko S. Kunii. Foreword by Gio Wiederhold

Lecture Notes in Computer Science 7323

Essential Series. Springer-Verlag London Ltd.

The Windows 10 Productivity Handbook

Springer Series in. advanced microelectronics 18

PostScript ej Acrobat/PDF

Symmetric Cryptographic Protocols

Springer-Verlag Berlin Heidelberg GmbH

Whitestein Series in software Agent Technologies. About whitestein Technologies

Agile Swift. Swift Programming Using Agile Tools and Techniques. Godfrey Nolan

Technical Challenges in Digital Preservation

Automated and Algorithmic Debugging

CAT (CURATOR ARCHIVING TOOL): IMPROVING ACCESS TO WEB ARCHIVES

SpringerBriefs in Computer Science

Web Programming with Dart. Moises Belchin Patricia Juberias

Microsoft Computer Vision APIs Distilled

Security Access in Wireless Local Area Networks

Lecture Notes in Artificial Intelligence 3932

Lecture Notes in Computer Science

WEB ARCHIVING OF MUSICOLOGICAL INTERNET RESOURCES

A. Portela A. Charafi Finite Elements Using Maple

Manufacturing Simulation with Plant Simulation and SimTalk

Pro MERN Stack. Full Stack Web App Development with Mongo, Express, React, and Node. Vasan Subramanian

Advances in Databases and Information Systems 1997

Java Quick Syntax Reference. Second Edition. Mikael Olsson

Data Curation with Autonomous Data Collection: A Study on Research Guides at Korea University Library

Open Source Software for Digital Forensics

Pro JavaScript Performance Monitoring and Visualization

Digital VLSI Design with Verilog

Transcription:

Web Archiving

Julien Masanès Web Archiving With 28 Figures and 6 Tables ABC

Author Julien Masanès European Web Archive 25 rue des envierges 75020 Paris, France julien.masanes@bnf.fr ACM Computing Classification (1998): H.3, H.4, I.7, K.4 Library of Congress Control Number: 2006930407 ISBN-10 ISBN-13 3-540-23338-5 Springer Berlin Heidelberg New York 978-3-540-23338-1 Springer Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer, Violations are liable for prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com c Springer-Verlag Berlin Heidelberg 2006 The use of general descriptive names, registered names, trademarks, etc. inthis publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting by the author and SPi Cover design: KünkelLopka, Heidelberg Printed on acid-free paper SPIN: 11307549/45/ 2145/ SPi 5 4 3 2 1 0

Contents 1 2 3. Web Archiving: Issues and Methods... 1 Julien Masanés 1.1 Introduction...1 1.2 Heritage, Society, and the Web...2 1.3 Web Characterization in Relation to Preservation...11 1.4 New Methods for a New Medium...18 1.5 Current Initiatives Overview...40 1.6 Conclusion...46 References...46 Web Use and Web Studies... 55 Steve Jones and Camille Johnson 2.1 Summary...55 2.2 Content Analysis...56 2.3 Surveys...58 2.4 Rhetorical Analysis...59 2.5 Discourse Analysis...60 2.6 Visual Analysis... 61 2.7 Ethnography...63 2.8 Network Analysis...64 2.9 Ethical Considerations...65 2.10 Conclusion...66 References...67 Selection for Web Archives... 71 Julien Masanés 3.1 Introduction...71 3.2 Defining a Selection Policy...72 3.3 Issues and Concepts...76 3.4 Selection Process...82 3.5 Documentation...89 3.6 Conclusion...89 References...90

VI Contents 4. Copying Websites... 93 Xavier Roche 4.1 Introduction The Art of Copying Websites...93 4.2 The Parser...95 4.3 Fetching Document...102 4.4 Create an Autonomous, Navigable Copy...107 4.5 Handling Updates...109 4.6 Conclusion... 112 Reference...112 5 6 7 Archiving the Hidden Web... 115 Julien Masanès 5.1 Introduction...115 5.2 Finding At Least One Path to Documents...116 5.3 Characterizing the Hidden Web...119 5.4 Client Side Hidden Web Archiving...121 5.5 Crawler-Server Collaboration...123 5.6 Archiving Documentary Gateways...125 5.7 Conclusion...127 References...128 Access and Finding Aids...131 Thorsteinn Hallgrimsson 6.1 Introduction...131 6.2 Registration...133 6.3 Indexing and Search Engines...135 6.4 Access Tools and User Interface...137 6.5 Case Studies...146 6.6 Acknowledgements... 151 References...151 Mining Web Collections...153 Andreas Aschenbrenner and Andreas Rauber 7.1 Introduction...153 7.2 Material for Web Archives...155 7.3 Other Types of Information...160 7.4 Use Cases...161 7.5 Conclusion...172 References...174

Contents VII 8 The Long-Term Preservation of Web Content...177 Michael Day 8.1 Introduction...177 8.2 The Challenge of Long-Term Digital Preservation...178 8.3 Developing Trusted Digital Repositories...181 8.4 Digital Preservation Strategies...184 8.5 Preservation Metadata...189 8.6 Digital Preservation and the Web...193 8.7 Conclusion...194 8.8 Acknowledgements... 194 References...194 9 Year-by-Year: From an Archive of the Internet to an Archive on the Internet... 201 Michele Kimpton and Jeff Ubois 9.1 Introduction...201 9.2 Background: Early Internet Publishing...202 9.3 1996: Launch of the Internet Archive...202 9.4 1997: Link Structure and Tape Robots...203 9.5 1998: Getting Archive Data Onto (Almost) Every Desktop..204 9.6 1999: From Tape to Disk, A New Crawler, and Moving Images... 205 9.7 2000: Building Thematic Web Collections...206 9.8 2001: Public Access with the Wayback Machine: The 9/11 Archive...207 9.9 2002: The Library of Alexandria, The Bookmobile, and Copyrights... 208 9.10 2003: Extending Our Reach via National Libraries and Educational Institutions... 210 9.11 2004: And the European Archive and the Petabox... 211 9.12 The Future... 211 References...212 10 Small Scale Academic Web Archiving: DACHS... 213 Hanno E. Lecher 10.1 Why Small Scale Academic Archiving?... 213 10.2 Digital Archive for Chinese Studies... 214 10.3 Lessons Learned: Summing Up... 223 10.4 Useful Resources... 224 List of Acronyms... 227 Index... 229