Open Archives Initiative protocol development and implementation at arxiv

Similar documents
arxiv, the OAI, and peer review

Exposing and Harvesting Metadata Using the OAI Metadata Harvesting Protocol: A Tutorial

OAI-PMH. DRTC Indian Statistical Institute Bangalore

OAI-PMH implementation and tools guidelines

Problem: Solution: No Library contains all the documents in the world. Networking the Libraries

Building Interoperable and Accessible ETD Collections: A Practical Guide to Creating Open Archives

RVOT: A Tool For Making Collections OAI-PMH Compliant

Version 2 of the OAI-PMH & some other stuff

Using metadata for interoperability. CS 431 February 28, 2007 Carl Lagoze Cornell University

The Open Archives Initiative and the Sheet Music Consortium

Building Interoperable Digital Libraries: A Practical Guide to creating Open Archives

IVOA Registry Interfaces Version 0.1

The Open Archives Initiative Protocol for Metadata Harvesting: An Introduction

Introduction to the OAI Protocol for Metadata Harvesting Version 2.0. Hussein Suleman Virginia Tech DLRL 17 June 2002

Metadata Harvesting Framework

Creating a National Federation of Archives using OAI-PMH


Network Information System. NESCent Dryad Subcontract (Year 1) Metacat OAI-PMH Project Plan 25 February Mark Servilla

Harvesting Metadata Using OAI-PMH

OAI-PMH repositories: Quality issues regarding metadata and protocol compliance

Interoperability and Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

OAI Static Repositories (work area F)

adore: a modular, standards-based Digital Object Repository

Better interoperability through the Open Archives Initiative

OAI AND AMF FOR ACADEMIC SELF-DOCUMENTATION

BIBLID (2004) 93:1 pp (2004.6) 209. NBINet NBINet 92

The Open Archives Initiative Protocol for Metadata Harvesting

CodeSharing: a simple API for disseminating our TEI encoding. Martin Holmes

Integrating Access to Digital Content

The multi-faceted use of the OAI-PMH in the LANL Repository

Tutorial. Open Archive Initiative

A Novel Architecture of Agent based Crawling for OAI Resources

Taking D2D Services to the Users with OpenURL, RSS, and OAI-PMH. Chuck Koscher Technology Director, CrossRef

Harvesting Statistical Metadata from an Online Repository for Data Analysis and Visualization

Indonesian Citation Based Harvester System

IMu OAI-PMH Web Service

Joining the BRICKS Network - A Piece of Cake

Open Archives Initiative Object Reuse & Exchange. Resource Map Discovery

ORCA-Registry v2.4.1 Documentation

Publishing Based on Data Provider

Go Sugimoto, Kerstin Arnold, Wim van Dongen, Yoann Moranville Reviewer: Lucile Grand

mod_oai: An Apache Module for Metadata Harvesting

Establishing New Levels of Interoperability for Web-Based Scholarship

You may print, preview, or create a file of the report. File options are: PDF, XML, HTML, RTF, Excel, or CSV.

The Observation of Bahasa Indonesia Official Computer Terms Implementation in Scientific Publication

Comparing Open Source Digital Library Software

How to contribute information to AGRIS

OAI-ORE. A non-technical introduction to: (

Open Archives Initiative Object Reuse & Exchange. Resource Map Discovery

Making scholarly statistics count in UK repositories. RSP Statistics Webinar Paul Needham, Cranfield University 26 February 2013

Presented by Sandro Zic

Institutional Repository using DSpace. Yatrik Patel Scientist D (CS)

Outline of the course

Metadata and Encoding Standards for Digital Initiatives: An Introduction

PDF Specification for IEEE Xplore (Part A-Core Requirements)

An introduction to OAI-PMH

The OAI2LOD Server: Exposing OAI-PMH Metadata as Linked Data

Building a Digital Library Software

CARARE Training Workshops

Expected and Unexpected Synergies

Chuck Cartledge, PhD. 25 February 2018

Archon - A Digital Library that Federates Physics Collections

Open Research Online The Open University s repository of research publications and other research outputs

The Sunshine State Digital Network

Implementing OAI Data and Service Providers

Flexible Design for Simple Digital Library Tools and Services

Orbis Cascade Alliance Content Creation & Dissemination Program Digital Collections Service. Enabling OAI & Mapping Fields in Digital Commons

Pushing the Limits. ADSM Symposium Sheelagh Treweek September 1999 Oxford University Computing Services 1

idigbio Data Ingestion Requirements and Guidelines

Repository Interoperability

ISO INTERNATIONAL STANDARD

Interoperability, Metadata, and Complex Objects

Corso di Biblioteche Digitali

Repository models and policies for preservation

Building a Digital Repository on a Shoestring Budget

The MEG Metadata Schemas Registry Schemas and Ontologies: building a Semantic Infrastructure for GRIDs and digital libraries Edinburgh, 16 May 2003

For many years, the creation and dissemination

EXTENDING OAI-PMH PROTOCOL WITH DYNAMIC SETS DEFINITIONS USING CQL LANGUAGE

Open Archives Forum - Technical Validation -

A methodology for Sharing Archival Descriptive Metadata in a Distributed Environment

Developing ArXivSI to Help Scientists to Explore the Research Papers in ArXiv

Initial Experiences Re-exporting Duplicate and Similarity Computations with an OAI-PMH Aggregator

Digital Objects, Data Models, and Surrogates. Carl Lagoze Computing and Information Science Cornell University

Invenio: A Modern Digital Library for Grey Literature

Types of Language Resource. The OLAC Metadata Set and Controlled Vocabularies. The Language Resources Community. Now: Underdevelopment

WP.25 ENGLISH ONLY. Supporting Paper. Submitted by Statistics Finland 1

An RDF NetAPI. Andy Seaborne. Hewlett-Packard Laboratories, Bristol

2nd Technical Validation Questionnaire - interim results -

Applying SOAP to OAI-PMH

Persistent identifiers, long-term access and the DiVA preservation strategy

Undergraduate Admission File

DELIVERABLE. D2.2: Modified MINT prototype. LoCloud. Local content in a Europeana cloud. Project Acronym: Grant Agreement number:

User guide. Created by Ilse A. Rasmussen & Allan Leck Jensen. 27 August You ll find Organic Eprints here:

INTRO INTO WORKING WITH MINT

WHY WE NEED AN XML STANDARD FOR REPRESENTING BUSINESS RULES. Introduction. Production rules. Christian de Sainte Marie ILOG

Registry Interchange Format: Collections and Services (RIF-CS) explained

Smart Routing. Requests

oatd.org Discovery for Open Access Theses and Dissertations An ASERL Webinar, October 15, 2013 These slides:

Harvesting of Additional Metadata Schema into DSpace through OAI-PMH: Issues and Challenges

RDF and Digital Libraries

Transcription:

Open Archives Initiative protocol development and implementation at arxiv Simeon Warner (Los Alamos National Laboratory, USA) (simeon@lanl.gov) OAI Open Day, Washington DC 23 January 2001 1

What is arxiv? http://arxiv.org/ aka Los Alamos e-print archive, formerly xxx. ( e-prints may be unpublished works, pre-prints or published works.) Unrefereed author self-archiving. No-fee retrieval by users worldwide.

Background Aug 1991 Physics e-print archive started: hep-th archive with email interface. 1992 ftp interface added. hep-ph and hep-lat added locally; alggeom, astro-ph and cond-mat added remotely. Dec 1993 Web interface added. Nov 1994 Data at some remote archives (using the same software) moved to main site, the remote sites become mirrors. Jun 1995 Automatic PostScript generation from TEX source. Apr 1996 PDF generation added. Jun 1996 Web upload facility added. from 1996 Worldwide mirror network grows. from 1999 arxiv involved in the OAI.

The present Covers physics, math, computer science, and non-linear systems. Serves over 70,000 users in over 100 countries. Estimated 13 million downloads in 2000. Over 30,000 new submissions in 2000, over 150,000 e-prints total (approximately linear growth in submission rate, 3500 extra each year). 99% of submissions entirely automated. Submission via web (68%), email (27%) and ftp (5%). Some journals now accept and arxiv identifiers instead of requiring direct submission (e.g. APS: Phys. Rev. D, Elsevier: Phys. Lett. B). Los Alamos site funded by DOE and NSF; mirror sites funded locally.

arxiv software The software running arxiv comprises of the order of 30,000 lines of Perl running under Linux with numerous other programs (TEX, ghostscript, tar, gnuzip,...). Software has evolved over past 9 years. Currently tidying and rewriting Perl code in modular fashion. Insufficient resources for complete off-line rewrite. Pressure to change underlying database and identifier structure.

Involvement of arxiv in the OAI The meeting held in Santa Fe in 1999, from which OAI has emerged, was organized by Paul Ginsparg (arxiv), Rick Luce and Herbert Van de Sompel. arxiv has continued to be actively involved in both management and technical development. The subset Dienst protocol resulting from the Santa Fe meeting was implemented at arxiv by 15 February 2000. Initial focus of the OAI was e-print archive interoperability. While the scope of OAI has expanded considerably, the e-print community has led the protocol development. The e-print community is, so far, the only community to have defined community specific formats for use in the OAI (see the description section of the Identify verb included in the protocol specification as an example).

OAI protocol v1.0 implementation arxiv is a data provider Concepts in protocol: identifiers, datestamps, sets, deleted records, and metadata formats. The verbs: Identify, ListSets, ListMetadataFormats, GetRecord, ListIdentifiers, ListRecords Flow control Cost

Identifiers Internal arxiv identifiers have the form: arch-ive/yymmnnn arch-ive.sc/yymmnnn OAI protocol restricts identifiers to follow the URI syntax. Appendix 2 describes the oai-identifier type which we use. In practice this means we prepend oai:arxiv: to our internal identifiers. hep-th/9901001 quant-ph/9912010 math.sg/0001001 cs.se/0101002 oai:arxiv:hep-th/9901001 oai:arxiv:quant-ph/9912010 oai:arxiv:math.sg/0001001 oai:arxiv:cs.se/0101002

Datestamps arxiv keeps logs of the date of: submission of different revisions, and dates of cross-listings. However, no logs have been kept of by-hand administrator modifications or the addition of journal-references. For OAI, we need a datestamp that reflects any change in the metadata. Use the file modification date extracted from the file modification time. Need to build indexes to support ListIdentifiers and ListRecords requests. Indexes must be updated daily to avoid missing updates.

Sets The OAI protocol characterizes sets as an optional construct for grouping items in a repository for the purpose of selective harvesting of records, and they may be arranged in zero or more hierarchies. Note: OAI data providers don t have to implement sets. Even when harvesting from data providers that implement sets, service providers don t have to use sets (i.e. they never specify a setspec).

Sets (cont d.) arxiv has a two- or three-level (depending on subject area) grouping hierarchy based on the subject of the e-print. The three levels are: group There are four groups: physics, math, cs, nlin archive The physics group has many archives (e.g. hep-th, astro-ph, cond-mat and even physics). Group and archive may be identified for the three other groups. subject-class Some archives have subject-classes and individual e-prints may belong to one or more subject-class. arxiv declares four sets which correspond to the four groups listed above.

Deleted records arxiv does not allow e-prints to be removed once submitted. We do permit authors to submit a withdrawal notice as a new revision. Our OAI interface exports metadata for the withdrawal notice in the same was as for any other e-print. There are also a very small number of e-prints that have been deleted. Currently 9 deleted e-prints in arxiv. Return the OAI status="deleted" attribute. Implemented using a lookup table.

Metadata formats We disseminate metadata for e-prints in the following formats: oai dc Dublin Core encoded in XML. oai rfc1807 RFC1807 encoded in XML. arxiv Test-bed for new internal XML metadata format. arxivold XML encoded version of current internal metadata format. Many mappings are obvious, e.g. Title Dublin Core description. Dublin Core title, and Abstract

Identify verb e.g. http://arxiv.org/oai1?verb=identify Trivial to implement, simply writes a number of configuration variables out in an XML record. arxiv returns two description containers: oai-identifier Declares our use of the OAI identifier scheme defined by oai-identifier.xsd, and gives a sample identifier oai:arxiv:quant-ph/9901001. eprints As we are part of the e-prints community, we include a description container that follows the eprints.xsd schema.

ListSets verb Again, very straightforward to implement. The code simply extracts the group names (long and short) from configuration variables and writes out as XML: <set> <setspec>nlin</setspec> <setname>nonlinear Sciences</setName> </set> <set> <setspec>math</setspec> <setname>mathematics</setname> </set>... <set> <setspec>cs</setspec> <setname>computer Science</setName> </set> Small number of sets no need to implement the partial response and acceptance of a resumptiontoken.

GetRecord verb The majority of the effort involved in implementing the GetRecord verb is in performing the metadata format conversion and mapping from our internal format to the format requested. Four cases to consider: 1. Item does not exist no <record> container returned. 2. Item is deleted <record status="deleted"> container with <header> block returned. 3. Item exists but can not be disseminated in the requested metadata format <record> container with <header> but no <metadata> block returned. 4. Item exists and can be disseminated in the requested metadata format <record> container with <header> and <metadata> blocks returned. Chose not to implement <about>.

ListIdentifiers verb This verb is essentially a search by datestamp, the optional from and until parameters specifying the datestamp range, and the optional setspec parameter limiting the archives searched. Don t want to return all 150,000 identifiers at once. Implement partial response, supply resumptiontoken as necessary. We choose to build resumptiontoken from the from, until and set- Spec parameters for new request.

ListRecords verb Essentially a combination of the ListIdentifiers and GetRecord verbs: Implement search by datestamp and set using ListIdentifiers code. Implement metadata dissemination using GetRecord code.

Flow Control The main arxiv site, http://arxiv.org/, is heavily used and has various automated scripts to prevent badly written and non-conforming (i.e. not obeying /robots.txt) robots from loading the server to the point where there is denial-of-service to other users. arxiv is particularly vulnerable because of the fact that most papers are stored as TEX source and processed to produce PostScript or PDF on demand (with a large cache). Flow control is thus essential to avoid legitimate OAI service providers getting blocked. Implemented with HTTP 503 and resumptiontoken. Minimum delay between requests from any given site, else Retry-After. Successfully avoids compliant harvesters from getting blocked (e.g. arc).

Implementation cost Santa Fe convention, OAI Dienst Subset, December 1999 2-3 weeks effort which included time to understand Dienst, writing utility routines such as an XML writing library (now available in standard libraries), coding a search by date, and coding conversion of our metadata to R- FC1807. OAI v0.2, October 2000 2 days which included writing TEX UTF-8 conversion code for the special characters which are currently mostly TEX encoded in our metadata. Also included rewriting parsing code for simplified syntax. OAI v1.0, January 2001 Many small changes during protocol development, most of which involved simplifications in the code! The majority of the effort necessary to create a new implementation is likely to be in routines to implement metadata format conversion; a search to find records by datestamp; and perhaps flow control through partial responses and Retry-After returns.

What happens now? Widespread adoption of OAI protocol by e-print and other communities. Development of enhanced metadata set specific to e-print community. Generic OAI tools using Dublin Core metadata. E-print specific tools using community specific metadata set. Better environment for research. Increased visibility for work submitted to arxiv.

That s all folks...