Metadata Harvesting Framework Library User 3. Provide searching, browsing, and other services over the data. Service Provider (TEL, NSDL) Harvested Records 1. Service Provider polls periodically for new records OAI protocol (over http) 2. New records downloaded and cached by the Service Provider Data Providers: (collection builders) OAI workshop, December 11, 2006 35 Multiple representations of an object MARC Record In XML Dublin Core Record In XML Qualified Dublin Core Record In XML MODS record In XML Honoré Daumier Lithograph (Brandeis University) OAI workshop, December 11, 2006 36 18
HTTP and XML The OAI-PMH is an almost stateless request/response protocol Requests and responses are sent using the HTTP protocol Requests are made using HTTP GET/POST operations Responses are returned as well-formed, valid XML documents OAI workshop, December 11, 2006 37 Well-formed and Valid XML Correct <car> <make>dodge</make> <model>spirit</model> <year>1994</year> <owner> <name>you</name> <plate>co</plate> </owner> </car> Incorrect <car> <make>dodge</make> <model>spirit</model> <year>1994 <owner> <plate>co</plate> <name>you</name> </car> </owner> OAI workshop, December 11, 2006 38 19
DTD, Schemas & Namespace DTD s: Document Type Definition Describe the elements of XML instance documents Not well-formed XML Some data-typing Namespaces harder to deal with Namespace: Schemas Describe the elements of XML instance documents Well-formed XML Strong data-typing Namespaces are easier to deal with Collection of related element names identified by a name label (e.g. dc) OAI workshop, December 11, 2006 39 XML Namespaces and Schema Consistency and data quality is ensured through XML schemas and schema validation Two separate XML namespaces are used: One that defines the OAI-PMH response Another that defines the metadata records contained in the response e.g. the record-level schema Example: http://www.dlese.org/oai/provider?verb=getrecord&metadataprefix=adn&id entifier=oai:dlese.org:dlese-000-000-000-690 OAI workshop, December 11, 2006 40 20
OAI repositories can be organized in sets OAI-PMH mechanism to allow for harvesting of subcollections Semantics for sets are defined outside of the protocol Sets are defined by conventions established between data and service providers, or just by the data provider Sets can be established that enable querying (e.g. by topic, author name, subject area, etc.) Example: The Open Digital Library (Suleman, 2001) OAI workshop, December 11, 2006 41 OAI repositories can be organized in sets What do sets represent? Journals: issues Institutional repositories: Departments, research centers, etc. Set representations may be constrained by the software package used. EPrint Archives: Subject, Publication Status Cultural Heritage Repositories: Collections with Intent 5 April, 2006 OAI workshop, December 11, 2006 42 21
Requirements to be a Data Provider Source of metadata Human or automated resource catalogers Metadata mappings Crosswalks from native formats to DC or other formats Server technology Handled by the OAI software Datestamps Indicates when the item was last changed (handled by the OAI software) Deletions Indicates if the item has been deleted and should be removed (handled by the OAI software) Unique identifiers Used to uniquely identify each item across repositories OAI workshop, December 11, 2006 43 Examples of repositories OAForum Information Resource Database is no longer active Refer to UKOLN site: http://www.ukoln.ac.uk/repositories/digirep/index/faqs More repositories at: http://www.openarchives.org/register/browsesites OAI workshop, December 11, 2006 44 22
Examples of services http://oaister.umdl.umich.edu http://www.theeuropeanlibrary.org http://cicharvest.grainger.uiuc.edu/ http://www.americansouth.org/ http://nsdl.org/ http://www.pictureaustralia.org/ http://imlsdcc.grainger.uiuc.edu/ http://www.language-archives.org/ OAI workshop, December 11, 2006 45 The OAI-PMH OAI-PMH Requests Identify ListMetadataFormats ListSets GetRecord ListIdentifiers ListRecords Resumption Tokens Used for flow control when large responses are required OAI workshop, December 11, 2006 46 23
OAI-PMH: overview and structure model OAI workshop, December 11, 2006 47 Key Definitions Harvester: client application issuing OAI-PMH requests Repository: network accessible server, able to process OAI- PMH requests correctly Set: optional construct for grouping items in a repository OAI workshop, December 11, 2006 48 24
Key Definitions Resource: object the metadata is "about", nature of resources is not defined in the OAI- PMH resources may be digital or non-digital Item: component of an repository from which metadata about a resource can be disseminated; has an unique identifier Record: metadata in a specific metadata format Identifier: unique key for an item in a repository OAI workshop, December 11, 2006 49 Protocol Details: Records A record is the metadata of a resource in a specific format. A record has three parts: a header and metadata, both of which are mandatory, and an optional about statement. Each of these is made up of various components as set out below. header (mandatory) - identifier (mandatory: 1 only) - datestamp (mandatory: 1 only) - setspec elements (optional: 0, 1 or more) - status attribute for deleted item metadata (mandatory) - XML encoded metadata with root tag, namespace - repositories must support Dublin Core, may support other formats about (optional) - rights statements - provenance statements OAI workshop, December 11, 2006 50 25
Protocol Details: Datestamps A datestamp is the date of last modification of a metadata record. Datestamp is a mandatory characteristic of every item. It has two possible levels of granularity: YYYY-MM-DD YYYY-MM-DDThh:mm:ssZ. The function of the datestamp is to provide information on metadata that enables selective harvesting using from and until arguments. Its applications are in incremental update mechanisms. It gives either the date of creation, last modification, or deletion. Deletion is covered with three support levels: no persistent transient. OAI workshop, December 11, 2006 51 Protocol Details: Metadata schema OAI-PMH supports dissemination of multiple metadata formats from a repository. The properties of metadata formats are: id string to specify the format (metadataprefix) metadata schema URL (XML schema to test validity) XML namespace URI (global identifier for metadata format) Repositories must be able to disseminate unqualified Dublin Core. The Dublin Core Metadata Element Set contains 15 elements. All elements are optional, and all elements may be repeated. Further arbitrary metadata formats can be defined and transported via the OAI-PMH. Any returned metadata must comply with an XML namespace specification. OAI workshop, December 11, 2006 52 26
Protocol Details: Sets Sets enable a logical partitioning of repositories. They are optional - archives do not have to define Sets. There are no recommendations for the implementation of Sets. Sets are not necessarily exhaustive of the content of a repository. They are not necessarily strictly hierarchical. It is important and necessary to have negotiated agreements within communities defining useful sets for the communities. function: selective harvesting (set parameter) applications: subject gateways, dissertation search engine, and others examples publication types (thesis, article,?) document types (text, audio, image,?) content sets, according to DNB (medicine, biology,?) OAI workshop, December 11, 2006 53 Protocol Details: Request format Requests must be submitted using the GET or POST methods of HTTP, and repositories must support both methods. At least one key=value pair: verb=requesttype (where RequestType is some type of request such as ListRecords) must be provided. Additional key=value pairs depend on the request type. example for GET request: http://archive.org/oai?verb=listrecords&metadataprefi x=oai_dc The encoding of special characters must be supported; for example, ":" (host port separator) becomes "%3A" OAI workshop, December 11, 2006 54 27
Protocol Details: Response Responses are formatted as HTTP responses. The content type must be text/xml. HTTP-based status codes, as distinguished from OAI-PMH errors, such as 302 (redirect) and 503 (service not available) may be returned. Compression codes are optional in OAI-PMH, only identity encoding is mandatory. The response format must be well-formed XML with markup as follows: 1. XML declaration (<?xml version="1.0" encoding="utf-8"?>) 2. root element named OAI-PMH with three attributes (xmlns, xmlns:xsi, xsi:schemalocation) 3. three child elements 1. responsedate (UTC datetime) 2. request (the request that generated this response) 3. a) error (in case of an error or exception condition) b) element with the name of the OAI-PMH request OAI workshop, December 11, 2006 55 Protocol Details: Flow control Four of the request types return a list of entries. Three of them may reply with 'large' lists. OAI-PMH supports partitioning. Those managing a repository make the decisions on partitioning: whether to partition and how. The response to a request includes: incomplete list resumption token expiration date, size of complete list, cursor (optional) For a new request with same request type: resumption token as parameter all other parameters omitted! The response includes the next (which may be the last) section of the list and a resumption token. That resumption token is empty if the last section of the list is enclosed. OAI workshop, December 11, 2006 56 28
Protocol Details: Flow control OAI workshop, December 11, 2006 57 Protocol Details: Errors and exceptions Repositories must indicate OAI-PMH errors by the inclusion of one or more error elements. The defined error identifiers are: badargument badresumptiontoken badverb cannotdisseminateformat iddoesnotexist norecordsmatch nometadataformats nosethierarchy OAI workshop, December 11, 2006 58 29
Request types There are six different request types: Identify ListMetadataFormats ListSets ListIdentifiers ListRecords GetRecord A harvester is not required to use all types. A repository must implement all types. There are required and optional arguments, depending on request types. OAI workshop, December 11, 2006 59 Request types: Identify function description of an archive example archive.org/oai-script?verb=identify parameters none errors / exceptions badargument (e.g. archive.org/oaiscript?verb=identify&set=biology) response format OAI workshop, December 11, 2006 60 30
Request types: Identify Response format Element Example repositoryname My Archive baseurl http://archive.org/oai protocolversion 2.0 earliestdatestamp 1999-01-01 deleterecords no, transient, persistent granularity YYY-MM-DD, YYYY-MM-DDThh:mm:ssZ adminemail oai-admin@archive.org compression deflate, compress description oai-identifier, eprints, friends, Ordinality: 1 = mandatory, 1 only; + = mandatory, 1 only; * = optional, 0 or more Ordinality 1 1 1 1 1 1 + * * Online example: http://www.dlese.org/oai/provider?verb=identify OAI workshop, December 11, 2006 61 Request types: ListMetadataFormats function retrieve available metadata formats from archive example archive.org/oai-script?verb=listmetadataformats& identifier=oai:huberlin.de:3000218 parameters identifier (optional) errors / exceptions badargument iddoesnotexist e.g. archive.org/oai-script?verb=listmetadataformats &identifier=really-wrong-identifier nometadataformats Online examples http://www.dlese.org/oai/provider?verb=listmetadataformats http://oai.bn.pt/servlet/oaihandler?verb=listmetadataformats&identifier=oai:oai.bn.pt:cienciasartes/2108 OAI workshop, December 11, 2006 62 31
Request types: ListSets function retrieve set structure of a repository example archive.org/oai-script?verb=listsets parameters resumptiontoken (exclusive) errors / exceptions badargument badresumptiontoken e.g. archive.org/oai-script?verb=listsets &resumptiontoken=any-wrong-token nosethierarchy Online examples http://www.dlese.org/oai/provider?verb=listsets http://oai.bn.pt/servlet/oaihandler?verb=listsets OAI workshop, December 11, 2006 63 Request types: ListIdentifiers function abbreviated form of ListRecords, retrieving only headers example archive.org/oai-script?verb=listidentifiers& metadataprefix=oai_dc&from=2002-12-01 parameters from (optional) until (optional) metadataprefix (required) set (optional) resumptiontoken (exclusive) errors / exceptions badargument (e.g.?&from=2002-12-01-13:45:00) badresumptiontoken cannotdisseminateformat norecordsmatch nosethierarchy online example http://www.dlese.org/oai/provider?verb=listidentifiers&metadataprefix=adn OAI workshop, December 11, 2006 64 32
Request types: ListRecords function harvest records from a repository example archive.org/oai-script?verb=listrecords& metadataprefix=oai_dc&set=biology parameters from (optional) until (optional) metadataprefix (required) set (optional) resumptiontoken (exclusive) errors / exceptions badargument badresumptiontoken cannotdisseminateformat norecordsmatch nosethierarchy Online example http://www.dlese.org/oai/provider?verb=listrecords&metadataprefix=oai_dc http://www.dlese.org/oai/provider?verb=listrecords&metadataprefix=adn http://oai.bn.pt/servlet/oaihandler?verb=listrecords&from=2006-01-01&until=2006-01- 30&set=bnd&metadataPrefix=tel OAI workshop, December 11, 2006 65 Request types: GetRecord function retrieve individual metadata record from a repository example archive.org/oai-script?verb=getrecord& identifier=oai:huberlin.de:3000218& metadataprefix=oai_dc parameters identifier (required) metadataprefix (required) errors / exceptions badargument cannotdisseminateformat iddoesnotexist online examples http://oai.bn.pt/servlet/oaihandler?verb=getrecord&identifier=oai:oai.bn.pt:bnd/porbase619&metadataprefix=tel http://www.dlese.org/oai/provider?verb=getrecord&identifier=oai%3adlese.org%3adlese-000-000-000-002&metadataprefix=adn OAI workshop, December 11, 2006 66 33
Turn key systems and modules CWIS : http://scout.wisc.edu/projects/cwis/ ContentDM : http://contentdm.com/ Digitool : http://www.exlibrisgroup.com/digitool.htm DSpace : http://www.dspace.org/ EPrints : http://software.eprints.org/ DLXS: http://www.dlxs.org/ OAICat: http://www.oclc.org/research/software/oai/cat.htm XMLFile: http://www.dlib.vt.edu/projects/oai/software/xmlfile/xmlfile.html DLESE OAI software: http://dlese.org/oai/index.jsp More tools at: http://www.openarchives.org/tools/tools.html OAI workshop, December 11, 2006 67 References 1. Building Interoperable Digital Libraries: A Practical Guide to creating Open Archives, Hussein Suleman, JCDL 2001 Tutorial. 2. A Framework for Building Open Digital Libraries, Hussein Suleman and Edward A. Fox, in D-Lib Magazine, December, 2001. http://www.dlib.org/dlib/december01/suleman/12suleman.html 3. The Open Archives Initiative http://www.openarchives.org 4. DLF/NSDL best practices for OAI and shareable metadata http://oai-best.comm.nsdl.org/cgi-bin/wiki.pl?tableofcontents 5. Open Archives Forum http://www.oaforum.org OAI workshop, December 11, 2006 68 34