Harvesting Metadata Using OAI-PMH

Similar documents
OAI-PMH. DRTC Indian Statistical Institute Bangalore

Problem: Solution: No Library contains all the documents in the world. Networking the Libraries


Integrating Access to Digital Content

The Open Archives Initiative and the Sheet Music Consortium

Building Interoperable and Accessible ETD Collections: A Practical Guide to Creating Open Archives

Metadata Harvesting Framework

RVOT: A Tool For Making Collections OAI-PMH Compliant

Metadata aggregation for digital libraries

IVOA Registry Interfaces Version 0.1

Flexible Design for Simple Digital Library Tools and Services

Joining the BRICKS Network - A Piece of Cake

Building Interoperable Digital Libraries: A Practical Guide to creating Open Archives

Introduction to the OAI Protocol for Metadata Harvesting Version 2.0. Hussein Suleman Virginia Tech DLRL 17 June 2002

Search Interoperability, OAI, and Metadata

Using metadata for interoperability. CS 431 February 28, 2007 Carl Lagoze Cornell University

Creating a National Federation of Archives using OAI-PMH

arxiv, the OAI, and peer review

Exposing and Harvesting Metadata Using the OAI Metadata Harvesting Protocol: A Tutorial

Open Archives Initiative protocol development and implementation at arxiv

Indonesian Citation Based Harvester System

Version 2 of the OAI-PMH & some other stuff

Network Information System. NESCent Dryad Subcontract (Year 1) Metacat OAI-PMH Project Plan 25 February Mark Servilla

Increasing access to OA material through metadata aggregation

OAI-PMH implementation and tools guidelines

Orbis Cascade Alliance Content Creation & Dissemination Program Digital Collections Service. Enabling OAI & Mapping Fields in Digital Commons

Building for the Future

Interoperability and Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

Taking D2D Services to the Users with OpenURL, RSS, and OAI-PMH. Chuck Koscher Technology Director, CrossRef

Developing data catalogue extensions for metadata harvesting in GIS

On Being a Hub: Some Details behind Providing Metadata for the Digital Public Library of America

Open Archives Initiatives Protocol for Metadata Harvesting Practices for the cultural heritage sector

OAI Static Repositories (work area F)

AN EXPLORATORY STUDY OF THE DESCRIPTION FIELD IN THE DIGITAL PUBLIC LIBRARY OF AMERICA

Repository Interoperability

CodeSharing: a simple API for disseminating our TEI encoding. Martin Holmes

Metadata: The Theory Behind the Practice

Cross-domain Metadata Interoperability for Integrated Information Services

The Design of a DLS for the Management of Very Large Collections of Archival Objects

The OAI2LOD Server: Exposing OAI-PMH Metadata as Linked Data

A methodology for Sharing Archival Descriptive Metadata in a Distributed Environment

IMu OAI-PMH Web Service

OAI AND AMF FOR ACADEMIC SELF-DOCUMENTATION

Metadata and Encoding Standards for Digital Initiatives: An Introduction

IMLS National Leadership Grant LG "Proposal for IMLS Collection Registry and Metadata Repository"

CARARE Training Workshops

oatd.org Discovery for Open Access Theses and Dissertations An ASERL Webinar, October 15, 2013 These slides:

Metadata Standards & Applications. 7. Approaches to Models of Metadata Creation, Storage, and Retrieval

DLF Update on Metadata Services

Harvesting Statistical Metadata from an Online Repository for Data Analysis and Visualization

Digital Library Interoperability. Europeana

How to contribute information to AGRIS

A Repository of Metadata Crosswalks. Jean Godby, Devon Smith, Eric Childress, Jeffrey A. Young OCLC Online Computer Library Center Office of Research

Applying SOAP to OAI-PMH

Applying Archival Science to Digital Curation: Advocacy for the Archivist s Role in Implementing and Managing Trusted Digital Repositories

The Observation of Bahasa Indonesia Official Computer Terms Implementation in Scientific Publication

Aquifer Gap Analysis Task Group Final Report August 15, 2006

COAR Interoperability Roadmap. Uppsala, May 21, 2012 COAR General Assembly

A Dublin Core Application Profile for Scholarly Works (eprints)

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

Slide 1 & 2 Technical issues Slide 3 Technical expertise (continued...)

Digital Libraries at Virginia Tech

Infrastructure for the UK

Assessing Metadata Utilization: An Analysis of MARC Content Designation Use

Registry Interchange Format: Collections and Services (RIF-CS) explained

The Open Archives Initiative in Practice:

OAI-PMH repositories: Quality issues regarding metadata and protocol compliance

Tutorial. Open Archive Initiative

The Sunshine State Digital Network

Comparing Open Source Digital Library Software

EXTENDING OAI-PMH PROTOCOL WITH DYNAMIC SETS DEFINITIONS USING CQL LANGUAGE

Signed metadata : method and application

adore: a modular, standards-based Digital Object Repository

Questionnaire for effective exchange of metadata current status of publishing houses

Hello, I m Melanie Feltner-Reichert, director of Digital Library Initiatives at the University of Tennessee. My colleague. Linda Phillips, is going

Harvesting of Additional Metadata Schema into DSpace through OAI-PMH: Issues and Challenges

Institutional Repository using DSpace. Yatrik Patel Scientist D (CS)

GNU EPrints 2 Overview

OAI-Publishers in Repository Infrastructures

GETTING STARTED WITH DIGITAL COMMONWEALTH

Digital Library Curriculum Development Module 5-d: Protocols (Last Updated: )

Usability Inspection Report of NCSTRL

CORE: Improving access and enabling re-use of open access content using aggregations

Networking European Digital Repositories

Using DSpace for Digitized Collections. Lisa Spiro, Marie Wise, Sidney Byrd & Geneva Henry Rice University. Open Repositories 2007 January 23, 2007

Metadata Overview: digital repositories

Data Exchange and Conversion Utilities and Tools (DExT)

Guidelines for Developing Digital Cultural Collections

Some challenges ahead for the Open Language Archives Community

The Open Archives Initiative Protocol for Metadata Harvesting

The Open Archives Initiative Protocol for Metadata Harvesting: An Introduction

Digits Fugit or. Preserving Digital Materials Long Term. Chris Erickson - Brigham Young University

Networking European Digital Repositories

2nd Technical Validation Questionnaire - interim results -

Metadata Catalogue Issues. Daan Broeder Max-Planck Institute for Psycholinguistics

The Scottish Collections Network: landscaping the Scottish common information environment. Gordon Dunsire

Open Archives Forum - Technical Validation -

The DART-Europe E-theses Portal

Advanced Tooling in MarcEdit TERRY REESE THE OHIO STATE UNIVERSITY

Networking European Digital Repositories

Transcription:

Harvesting Metadata Using OAI-PMH Roy Tennant California Digital Library Outline The Open Archives Initiative OAI-PMH The Harvesting Process Harvesting Problems Steps to a Fruitful Harvest A Harvesting Service Model The OAI Future Open Archives Initiative Aimed at making the large and growing number of repositories of freely available digital content interoperable Protocol for Metadata Harvesting (OAI-PMH) specifies how repositories can expose their metadata for others to harvest Over 800 repositories world-wide support the protocol OAIster.org has indexed nearly 6 million items from over 500 of those repositories www.oaforum.org/tutorial/ OAI-PMH OAI Architecture Data providers (DP) those with the stuff Service providers (SP) those who harvest metadata and provide aggregation and search services Software for both DPs and SPs readily available OAI-PMH verbs: Identify ListIdentifiers ListMetadataFormats ListSets ListRecords GetRecord Source: Open Archives Forum Tutorial 1

Identify Provides basic information about a repository ListMetadataFormats Lists available metadata formats ListIdentifiers Lists all identifiers (or only those of the optionally specified set) Must include metadataprefix attribute ListSets Library of Congress ListSets response Lists available sets 2

ListRecords Lists all records (or only those of the optionally specified set) Must include metadataprefix attribute GetRecord Retrieves a specific record Must include metadataprefix and identifier attributes The Harvesting Process A Harvesting Service Model Identifying Sources Selecting Sets Harvesting Metadata Processing Indexing Interface errol.oclc.org gita.grainger.uiuc.edu/registry/ 3

Selecting Sets Review the response to the ListSets verb May be instructive to search the collection in the native interface, if possible Look for descriptive pages on the site being harvested Harvesting Many harvesting applications are available, I will focus on: Public Knowledge Project (PKP) Harvester Virginia Tech Perl Harvester Library software vendors increasingly offer harvesting products (e.g., ExLibris MetaIndex) Virginia Tech Perl Harvester +-----------------------------------------+ Harvester Sample Configurator +-----------------------------------------+ Version 1.1 :: July 2002 Hussein Suleman <hussein@vt.edu> Digital Library Research Laboratory www.dlib.vt.edu :: Virginia Tech ------------------------------------------+ Defaults/previous values are in brackets - press <enter> to accept those enter "&delete" to erase a default value enter "&continue" to skip further questions and use all defaults press <ctrl>-c to escape at any time (new values will be lost) Press <enter> to continue [ARCHIVES] Add all the archives that should be harvested Current list of archives: No archives currently defined! Select from: [A]dd [D]one Enter your choice [D] : a{return} [ARCHIVE IDENTIFIER] You need a unique name by which to refer to the archive you will harvest metadata from Examples: nsdl-380602, VTETD Archive identifier [] : nsdl-380602{return} 4

Let s Harvest! Indexing Pick your favorite database/indexing software: MySQL SWISH-E Whatever is lying around May need to specifically set up a method to search across the entire record May need different fields for indexing than for display Will need to deal with element collision Interface Harvesting Problems Software interface (API) for other applications: SRU/SRW? MXG? Arbitrary Web Services schema? User interface: What functions do you want your users to be able to perform? What kinds of displays do you want? Sets Metadata Formats Metadata Artifacts Granularity Metadata Variances Sets Records are harvested in clumps, called sets created by DPs No guidelines exist for defining sets Examples: Collection Organizational structure Format (but is a page image an image? See example) Metadata Formats Only required format is simple Dublin Core, although any format can be made available in addition Few DPs surface richer metadata Simple DC is simply too simple! Example (artifact vs. surrogate dates) 5

Metadata Artifacts Granularity unintended, unwanted aberrations Sample causes: Idiosyncratic local practices Anachronisms HTML code Examples: Circa = string of dates for searching purposes [electronic resource] Record Granularity: what is an object? A book, or each individual page? Examples: CDL, Univ. of Michigan Metadata Granularity: Multiple values in one field Example: Univ. of Washington Metadata Variances Subject terminology differences Disparities in recording the same metadata Example: date variances Mapping oddities or mistakes Examples: 1) format into description, 2) description into subject Steps to a Fruitful Harvest Needs Assessment (it s the user, stupid) DP Identification and Communication Metadata Capture Metadata Analysis Metadata Subsetting Metadata Normalization Metadata Enrichment Indexing & Display Interface (it s still the user, stupid) Needs Assessment DP Identification & Communication What are you trying to accomplish? What will your users want to be able to do? What metadata will you need, and what procedures will you need to set up to enable these activities? Which repositories have what you want? Is what they have (e.g., sets, metadata) usable as is, or? Identification: Use UIUC directory of DPs to identify potential sources Communication: Not required to tell them you are harvesting, but may help establish a good relationship May want to request that they surface a richer metadata format and/or provide a different set 6

Metadata Capture Sample questions to answer: Individual sets, or all? Richer metadata formats available? How frequently to reharvest? Start from scratch each time or update? Many software options Metadata Analysis Finding out what you have (and don t have) Encoding practices Gap analysis (e.g., missing fields, etc.) Mistakes (e.g., mapping errors) Software can help Commercial software like Spotfire In-house or open source software tools Five elements are used 71% of the time Source: 2002 Master s Thesis, Jewel Hope Ward, UNC Chapel Hill 7

Metadata Subsetting DP sets are unlikely to serve all SP uses well SPs will need the ability to subset harvested metadata Metadata Normalization Normalizing: to reduce to a standard or normal state Prototype date normalization service screen Metadata Enrichment Adding fields and/or qualifiers may be useful or required, for example: Metadata provider information Geographic coverage Subject terms mapped to a different thesaurus Authority control record The enrichment process may be the same tool as the subsetting tool (i.e., find a cluster of records and perform an action) Indexing & Display A Harvesting Service Model Selected fields may need to be mapped to specific indexing and display elements Particularly required if harvesting different metadata formats But also needs to be done with multiple, conflicting fields: <date>1863.</date> <date>[2001 or 2002.]</date> <identifier>shs 1,679</identifier> <identifier>http://content.lib.washington.edu/cgi-bin/htmlview.exe?cisoroot=/loc&cis <identifier>http://content.lib.washington.edu/loc/image/1679.jpg</identifier> 8

The OAI Future oai-best.comm.nsdl.org Further protocol development Services layered on top of OAI-PMH Shared software tools Best practices for both DPs and SPs 9