CAT (CURATOR ARCHIVING TOOL): IMPROVING ACCESS TO WEB ARCHIVES

Similar documents
Archiving the Web: What can Institutions learn from National and International Web Archiving Initiatives

Strategy for long term preservation of material collected for the Netarchive by the Royal Library and the State and University Library 2014

Becoming a Web Archivist: My 10 Year Journey in the National Library of Estonia

Archiving and Preserving the Web. Kristine Hanna Internet Archive November 2006

August 14th - 18th 2005, Oslo, Norway. Web crawling : The Bibliothèque nationale de France experience

Oshiba Tadahiko National Diet Library Tokyo, Japan

Workshop B: Archiving the Web

Lider Roadmapping Workshop

Legal Deposit of Online Newspapers at the BnF - Clément Oury - IFLA PAC Paris 2012

A structured workflow for implementing digital archiving standards in an organisation

Preservation of Web Materials

Comparing Open Source Digital Library Software

Content Organization and Knowledge Management in the Digital Environment

From The European Library to The European Digital Library. Jill Cousins Inforum, Prague, May 2007

Digital Preservation at NARA

Content Identification for Audiovisual Archives

Transferring vital e-records to a trusted digital repository in Catalan public universities (the iarxiu platform)

Big Data, exploiter de grands volumes de données

The e-depot in practice. Barbara Sierman Digital Preservation Officer Madrid,

Benefits of CORDA platform features

Bridging Continents. Kazu Yamaji National Institute of Informatics JAPAN

Building a Web Curator Tool for The National Library of New Zealand

Oh My, How the Archive has Grown: Library of Congress Challenges and Strategies for Managing Selective Harvesting on a Domain Crawl Scale

KM COLUMN. How to evaluate a content management system. Ask yourself: what are your business goals and needs? JANUARY What this article isn t

ProMenPol Database Description

The OAIS Reference Model: current implementations

European Conference on Quality and Methodology in Official Statistics (Q2008), 8-11, July, 2008, Rome - Italy

NDSA Web Archiving Survey

User Manual Version August 2011

Terms in the glossary are listed alphabetically. Words highlighted in bold are defined in the Glossary.

WEB-BASED COLLECTION MANAGEMENT FOR ARCHIVES

State Government Digital Preservation Profiles

Non-text theses as an integrated part of the University Repository

Hundred and seventy-fifth session REPORT BY THE DIRECTOR-GENERAL ON THE IMPLICATIONS OF THE PROCLAMATION OF A WORLD DAY FOR AUDIOVISUAL HERITAGE

Survey of Research Data Management Practices at the University of Pretoria

Edinburgh DataShare: Tackling research data in a DSpace institutional repository

WHO-ITU National ehealth Strategy Toolkit

Preservation Planning in the OAIS Model

The DART-Europe E-theses Portal

THE NEW NATIONAL ATLAS OF SPAIN ON INTERNET

Future Core Ground Segment Scenarios

The Biblioteca de Catalunya and Europeana

NARCIS: The Gateway to Dutch Scientific Information

ICGI Recommendations for Federal Public Websites

State Government Digital Preservation Profiles

Applying Archival Science to Digital Curation: Advocacy for the Archivist s Role in Implementing and Managing Trusted Digital Repositories

Dryad Curation Manual, Summer 2009

ESFRI Strategic Roadmap & RI Long-term sustainability an EC overview

7.3. In t r o d u c t i o n to m e t a d a t a

Draft Resolution for Committee Consideration and Recommendation

Introduction SCONE and IRIScotland Scaling General retrieval Interoperability between SCONE and IRIScotland Context

Writing a Data Management Plan A guide for the perplexed

Metadata Framework for Resource Discovery

Website Implementation D8.1

Between 1981 and 1983, I worked as a research assistant and for the following two years, I ran a Software Development Department.

Best Practice Guidelines for the Development and Evaluation of Digital Humanities Projects

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

TENtec OMC ver. 4 M 4.07

Managing Web Resources for Persistent Access

LEARNING OBJECT METADATA IN A WEB-BASED LEARNING ENVIRONMENT

BHL-EUROPE: Biodiversity Heritage Library for Europe. Jana Hoffmann, Henning Scholz

DL User Interfaces. Giuseppe Santucci Dipartimento di Informatica e Sistemistica Università di Roma La Sapienza

Creating our Digital Cultural Heritage. Characteristics of Digital Information

Data Curation Handbook Steps

Crossing the Archival Borders

Navigating the Universe of ETDs: Streamlining for an Efficient and Sustainable Workflow at the University of North Florida Library

OpenAIRE. Fostering the social and technical links that enable Open Science in Europe and beyond

Table of contents for The organization of information / Arlene G. Taylor and Daniel N. Joudrey.

Open Archives Forum - Technical Validation -

GETTING STARTED WITH DIGITAL COMMONWEALTH

Pascal Gilles H-EOP-GT. Meeting ESA-FFG-Austrian Actors ESRIN, 24 th May 2016

The e-journals catalogue of Padua University: a proposal for a light solution to a heavy problem

WEB-BASED COLLECTION MANAGEMENT FOR LIBRARIES

For those of you who may not have heard of the BHL let me give you some background. The Biodiversity Heritage Library (BHL) is a consortium of

The Data Management Plan: Putting policy into practice Suzanne Clarke Director, Information Resources

Emerging Technologies in Knowledge Management By Ramana Rao, CTO of Inxight Software, Inc.

The United Republic of Tanzania. Domestication of Sustainable Development Goals. Progress Report. March, 2017

PERSISTENT IDENTIFIERS FOR THE UK: SOCIAL AND ECONOMIC DATA

Indiana University Research Technology and the Research Data Alliance

IDM Internet User Manual

2nd Technical Validation Questionnaire - interim results -

USER S GUIDE FOR THE ECONOMICS ELECTRONIC LIBRARY

An Introduction to Digital Preservation

Open Archives Initiatives Protocol for Metadata Harvesting Practices for the cultural heritage sector

Brown University Libraries Technology Plan

A tool for Entering Structural Metadata in Digital Libraries

Introduction

NEDLIB LB5648 Mapping Functionality of Off-line Archiving and Provision Systems to OAIS

1. Name of Your Organization. 2. About Your Organization. Page 1. nmlkj. nmlkj. nmlkj. nmlkj. nmlkj. nmlkj. nmlkj. nmlkj. nmlkj. nmlkj. nmlkj.

Ex Libris Integrated and Consortia Solutions.

For many years, the creation and dissemination

Data Partnerships to Improve Health Frequently Asked Questions. Glossary...9

Linking library data: contributions and role of subject data. Nuno Freire The European Library

The following is a list of several organizations which readers may like to contact for further information.

CONTENTdm & The Digital Collection Gateway New Looks for Discovery and Delivery

What our members see as being the value of TM 2.0

Part 2: Current State of OAR Interoperability. Towards Repository Interoperability Berlin 10 Workshop 6 November 2012

Research Data Edinburgh: MANTRA & Edinburgh DataShare. Stuart Macdonald EDINA & Data Library University of Edinburgh

Legal Deposit on the Internet: A Case Study

Donor Countries Security. Date

Transcription:

CAT (CURATOR ARCHIVING TOOL): IMPROVING ACCESS TO WEB ARCHIVES Ciro Llueca, Daniel Cócera Biblioteca de Catalunya (BC) Barcelona, Spain padicat@bnc.cat Natalia Torres, Gerard Suades, Ricard de la Vega Centre de Supercomputació de Catalunya (CESCA) Barcelona, Spain padicat@cesca.cat ABSTRACT PADICAT is the web archive created in 2005 in Catalonia (Spain) by the Biblioteca de Catalunya (BC), the National Library of Catalonia, with the aim of collecting, processing and providing permanent access to the digital heritage of Catalonia. Its harvesting strategy is based on the hybrid model (massive harvesting of.cat top level domain; selective compilation of the web site output of Catalan organizations; focused harvesting of public events). The system provides open access to the whole collection, on the Internet. We consider necessary to complement the current search and visualization software with a new open source software tool, CAT (Curator Archiving Tool), composed by three modules aimed to managing effectively the processes of human cataloguing; to publish the digital resources on directories and special collections; and to offer statistical information of added value to end users. Within the framework of the International Internet Preservation Consortium meeting (Vienna 2010), the progresses in the development of this new tool, and the philosophy that has motivated his design, are presented to the international community. 1. INTRODUCTION PADICAT (in english Digital Heritage of Catalonia) is the web archive created in 2005 in Catalonia (Spain) by the Biblioteca de Catalunya (BC), the National Library of Catalonia, and the Centre de Supercomputació de Catalunya (CESCA), with the aim of collecting, processing and providing permanent access to the digital heritage of Catalonia, understood as the entire cultural, scientific and general output of Catalonia in digital format 1 and published on the Internet. The goal of PADICAT 2 is to archive the Catalan Internet. The BC is member of the IIPC (International Internet Preservation Consortium) and, as in the rest of ongoing projects, the web archive placed in Barcelona is based on the application of several IT software that allows the crawl, storage, preservation and the permanent access to a series of versions of web pages published by the Catalan community 3, in Spain. From the initial analysis of the Internet Archive, Kulturarw3 (National Library of Sweden), Pandora (National Library of Australia), and Netarkivet.dk (The Royal Library, State & University Library, Denmark) models, and according to what we consider the common trend among national libraries and archives, the model of PADICAT s repository is the hybrid one, consisting of: Mass compilation of open-access digital resources published on the Internet, through the exhaustive harvesting of.cat top level domain. To stimulate the systematic archiving of the web site output of Catalan organizations, through its identification and the signing of cooperation agreements with the entities and companies representative of the Catalan society. Fostering lines of research through themed integration of the digital resources pertaining to specific events in Catalan public life, such as electoral campaigns on the Internet, Catalan music on-line or museums of Catalonia. PADICAT uses in his daily functioning Heritrix, Nutchwax, Wera, Wayback Machine, as well as Web Curator Tool software (WCT), developed by the National Library of New Zealand in collaboration with the British Library. 1 Webb, C. Guidelines for the Preservation of Digital Heritage. United Nations Educational, Scientific and Cultural Organization, Paris, 2003. http://unesdoc.unesco.org/images/0013/001300/130071e.pdf 2 See What PADICAT is at http://www.padicat.cat/en/quees.php 3 Gomes, D.; Silva, M. J. Characterizing a National Community Web. ACM Transactions on Internet Technology, vol 5, num 3 (Aug 2005). http://xldb.fc.ul.pt/daniel/gomescharacterizing.pdf

In September, 2010, PADICAT offers through his web site (http://www.padicat.cat/en/) open access to 53.249 crawls from 30.481 web sites and keep cooperation agreements with 442 public administrations, companies, universities, professional associations and cultural centers of Catalonia. The open access offered to the whole collection, through searching or browsing, is one of the strong points of the Catalan web archive. Together with the goal of preserving the published information on the Internet, visualization of the crawled collection through Internet has become a priority: being aware of the legal restrictions that concerns the majority of existing web archives, PADICAT has followed, from the beginning, the premise of the open access to his collection, premise defended by the Internet Archive, at the same time that has been weaving manually a network of entities and companies that allows, thanks to the signing of cooperation agreements, to disseminate the crawled versions through web archive s site. The strategic goal of guarantee an optimum retrieval and visualization of the crawled information has been carried out through the assignment of added value to the crawled resources data, associating new descriptive information to the digital articles. In short, cataloguing the web resources that PADICAT has been harvesting and those that has to harvest soon. At the present time, 17.700 sites have been catalogued. The information provided by manual cataloguing will have to allow an essential improvement in the identification of the resources and its retrieval according to a standardized title, the alphabetical positioning in the thematic directory of the repository, its belonging and indexing inside thematic concrete categories, the interrelationship between the resources inside the PADICAT collection, etc. The following challenge will be the modification of the scoring of the current search systems (Wera and Wayback) considering the metadata coming from the cataloguing. Definitively, new search capabilities that will have to offer as a final result a better visibility of those resources that make up the collection, allowing users to interrogate the system without the need of knowing the URL of the desired resource. The cataloguing of web reources has stimulated the design and the production of a set of tools that we call CAT (Curator Archiving Tool). The aim of this report distributed at the International Internet Preservation Consortium (sited in the 7th International Conference on Preservation of Digital Objects, ipres2010) is to show to the international community the improvements raised by PADICAT staff on the processes of retrieval and visualization of the information contained on web archives, as well as to put at the public disposal for the equivalent projects the advances in these lines of research and implementation. 2. CAT (CURATOR ARCHIVING TOOL) It seems evident that harvesting process is well covered by quite mature software, such as Heritrix. However, indexing software and the corresponding modules for visualization have not evolve at the same pace. As in the rest of catalogues, the retrieval of information contained in web archives is carried out through searching or browsing, understood as navigation using a thematic directory. Only the searching part has specific software quite drawn to the community, such as WERA and Wayback, searching by URL or keyword query. Navigation through thematic directories, as used in web archives like Pandora or UK WAC, requires the traditional process of cataloguing, like it s being carried out also in PADICAT, with the final goal of integrating his collections to the BC s catalogue. Therefore, the steps of the process of web site resources managements, such as the harvesting or searching, have mature software and are vastly used in similar projects, but there is a gap in open source software for the cataloguing and publication of these resources. In front of this challenge, and to guarantee a better retrieval and visualization by non-traditional web archive users, PADICAT staff have developed several implementations and carried out various tests since 2005, that have made possible, nowadays, to launch a tool addressed to influence positively the processes of searching and information retrieval, as well as in the presentation of this information to users consulting a web archive. These are the modules that make up CAT (Curator Archiving Tool), thought taking into account criteria of modularity, scalability and internationalization, in order to their publication as open source software. It has been designed software made up by three modules: the basic one, based on the cataloguing process (MOCA), complemented with more functionalities from the additional modules: publication (MOPU), and statistics (MOST), all three of them make up the software CAT (Curator Archiving Tool), object of the present communication.

Figure 1. Software used at PADICAT A description of the modules that make up CAT (Curator Archiving Tool) is explained next. simplifies the user pages and allows controlling accurately the actions that each of them can carry out. Three roles have been defined based on the professional profiles of the staff involved in the cataloguing process. These roles are manager (maintenance tasks of the tool), cataloguer (inserts and updates the catalogued resources) and observer (control of the catalogued resources). It is mandatory to check in MOCA's workflow if a resource already exists in the catalogue in order to avoid duplicated resources. Once passed this first step then the minimum metadata required has to be introduced and then saved. 2.1. MOCA (Cataloguing Module) The cataloguing of resources started using already existing tools, essentially pioneering Web Curator Tool (WCT). With the experience lacks were detected in the current software. Also some particular needs from the project are not covered neither. To solve it, a development of a new tool focused only on the cataloguing process was considered. One of the main MOCA s requirements is that it should allow to record a changelog in the metadata of the resources catalogued because this information might be relevant: a change in the associated URL of the preserved resource is a clear case to take into account. The information collected to catalog the resources is based on Dublin Core metadata model. It can be broadened easily without altering the data model. In PADICAT crawled resources are processed following an indexing system 4 of Biblioteca de Catalunya, which is based on a thematic classification specific to the PADICAT web portal, and also according to the origin of the guideline crawl (because it belongs to.cat domain, collaboration agreement websites, proposed resources or monographic collections). An authentication system allows only to log in on the application the users identified with user and password, and once validated the user only can access to the actions that he can carry out according to the assigned roles. The use of roles Figure 2. MOCA workflow Some screens of the main interfaces are shown below: the interface allows introducing the metadata associated to a resource, searching interface allows making combined searches using any field of metadata and batch update interface allows updating multiple resources at the same time. Figure 3. MOCA insert resources screen shot v1.0 4 LEMAC, Llista d encapçalaments de matèria en català, adaptation of the LCSH, Library of Congress Subject Headings with contributions proceeding from Laval RVM, Répertoire des Vedettes-Matière, and RAMEAU, Répertoire d Autorité Matiére Encyclopédique et Alphabétique Unifié, that BC maintains.

Figure 4. MOCA search resources screen shot v1.0 Once this tool has been developed several scripts were created to allow the migration of the existent information in WCT to the new MOCA's data model. The release date of MOCA s 1.0 version is scheduled to the end of March 2011. Figure 6. MOPU workflow MOPU needs the definition of a new role, known as validator. This new user role will be in charge of the management of the whole publication process. Validator will check the metadata inserted by cataloguers and he will also give his approval to the publication of the resource in the thematic directory. Moreover, if necessary he will select which metadata will be shown on the item information page. If the validator detects any deficiency on metadata, he will mark the resource so as to be checked by cataloguers. Additionally, through an attached note field to the resource the validator will write down the reason why the metadata needs to be changed. Figure 5. MOCA batch update screen shot v1.0 2.2. MOPU (Publication Module) MOPU will automatically generate the thematic directory where all resources that were previously catalogued using MOCA and also were selected to be published will be available. Optionally each resource will have an attached item information page with descriptive information arising from the cataloguing metadata that were considered relevant for the visualization of the crawled resource. Metadata included in the item information page are selected in individualized and customized method for each resource. In order to use MOPU, it is mandatory to have MOCA properly installed. In the same way as MOCA, in MOPU a changelog that will allow tracking the publication process and the changes in the item information page will be automatically generated. Some screens of the main interfaces of the prototype pilot are shown below: management interface of the thematic directory according to categories and metadata selection interface and the interface for the selection of metadata to include in the item information page.

3. CONCLUSIONS Harvesting web sites published on the Internet is covered enough by quite mature software, like Heritrix. However, indexing software and the corresponding modules for visualization have not evolved at the same pace. For web archives like PADICAT, that offers open access on the Internet to his collection, is essential to improve the processes of search and visualization. Figure 7. MOPU prototype pilot screen shot The conception of the software CAT (Curator Archiving Tool), allows improving the search and visualization of the resources preserved in web archives, thanks to his contributions in: - Human impact in describing and indexing through the Cataloguing module. - Automated generation of directories and special collections through the Publication module. - Increase of the offer of statistical data through the Statistic module. For the web archive managers, the interaction of the three modules will allow saving time thanks to the automation of processes that are manual at the moment: Figure 8. MOPU prototype pilot screen shot 2.3. MOST (Statistics Module) This module will allow generating statistics on the information included in the repository from the reports generated by the crawling software, such as number and type of files belonging to each preserved crawl, size and frequency of each crawl, etc. This is an added value service because PADICAT offers searching and browsing through all crawled resources in open access. Binding this module with MOCA and MOPU will allow including statistics on the resource item information page. 2.4. Roadmap As soon as a mature enough and stable version of the CAT (Curator Archiving Tool) modules is reached, with feedback on their use, it will be released to the community as open source software. Release date for MOCA is expected to be before the end of March 2011. - Automated integration of preserved resources in alphabetical lists, in thematic directories or in special collections that make up the access to the collection through navigation. - Batch update of catalogued resources sharing the same characteristics. - Improvement of workflows between different roles that operate in cataloguing and publication of resources at web archive s site. - To follow the strategy of Biblioteca de Catalunya (BC), with the goal of the integration of the preserved resources into catalogues and ordinary search systems of the Library. For the web archive users, the creation and starting of the software CAT (Curator Archiving Tool) will allow an improvement on the search capabilities and the visualization of preserved resources: - Search by thematic navigation, more effectively and exhaustively, complementing the current systems of searching by keyword or URL. - Search by keyword in a more appropriate way, once modified the current search system scoring with the metadata proceeding from cataloguing.

- Integrated search, once the PADICAT collection has been integrated into the catalogues and ordinary search systems of the Biblioteca de Catalunya. - Access to added value information for each preserved resource, coming from human cataloguing or from system-generated statistics. For the international community, the publication as open source software of the three modules of CAT (Curator Archiving Tool) allows to: - Contribute to the goal of acquire, preserve and make accessible the Internet information for future generations, around the world, promoting the global exchange and international relations. - Foster human intervention in cataloguing and publication processes in web archives to improve the access and visualization of these systems. - Achieve public and political appreciation to the usefulness of Internet digital preservation projects, as the web archives presents at the IIPC annual meeting are.