CodeSharing: a simple API for disseminating our TEI encoding. Martin Holmes

Similar documents
Using metadata for interoperability. CS 431 February 28, 2007 Carl Lagoze Cornell University

OAI-PMH. DRTC Indian Statistical Institute Bangalore

Problem: Solution: No Library contains all the documents in the world. Networking the Libraries


IVOA Registry Interfaces Version 0.1

Metadata Harvesting Framework

Indonesian Citation Based Harvester System

Publishing Based on Data Provider

Version 2 of the OAI-PMH & some other stuff

Creating a National Federation of Archives using OAI-PMH

XSLT: where does it fit in? Martin Holmes

OAI-PMH implementation and tools guidelines

IMu OAI-PMH Web Service

The Open Archives Initiative Protocol for Metadata Harvesting: An Introduction

OAI Static Repositories (work area F)

Introduction to the OAI Protocol for Metadata Harvesting Version 2.0. Hussein Suleman Virginia Tech DLRL 17 June 2002

Chuck Cartledge, PhD. 25 February 2018

Developing an Institutional Repository Service in Chinese Academy of Sciences

Harvesting Metadata Using OAI-PMH

Open Archives Initiative protocol development and implementation at arxiv

Joining the BRICKS Network - A Piece of Cake

OAI AND AMF FOR ACADEMIC SELF-DOCUMENTATION

Flexible Design for Simple Digital Library Tools and Services

OAI-PMH repositories: Quality issues regarding metadata and protocol compliance

Hello, I m Melanie Feltner-Reichert, director of Digital Library Initiatives at the University of Tennessee. My colleague. Linda Phillips, is going

Integrating Access to Digital Content

arxiv, the OAI, and peer review

Registry Interchange Format: Collections and Services (RIF-CS) explained

Institutional Repository using DSpace. Yatrik Patel Scientist D (CS)

Presented by Sandro Zic

The Open Archives Initiative Protocol for Metadata Harvesting

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

ORCA-Registry v2.4.1 Documentation

A Repository of Metadata Crosswalks. Jean Godby, Devon Smith, Eric Childress, Jeffrey A. Young OCLC Online Computer Library Center Office of Research

Interoperability and Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

Open Archives Initiative Object Reuse & Exchange. Resource Map Discovery

Taking D2D Services to the Users with OpenURL, RSS, and OAI-PMH. Chuck Koscher Technology Director, CrossRef

Metadata and Encoding Standards for Digital Initiatives: An Introduction

Building Interoperable and Accessible ETD Collections: A Practical Guide to Creating Open Archives

Mobile Application for Accessing Paper Citation with Social Network Feature

Open Archives Initiative Object Reuse & Exchange. Resource Map Discovery

An Architecture to Share Metadata among Geographically Distributed Archives

The Open Archives Initiative and the Sheet Music Consortium

Orbis Cascade Alliance Content Creation & Dissemination Program Digital Collections Service. Enabling OAI & Mapping Fields in Digital Commons

Bringing Europeana and CLARIN together: Dissemination and exploitation of cultural heritage data in a research infrastructure

RVOT: A Tool For Making Collections OAI-PMH Compliant

Harvesting Statistical Metadata from an Online Repository for Data Analysis and Visualization

Applying SOAP to OAI-PMH

An introduction to OAI-PMH

Data Exchange and Conversion Utilities and Tools (DExT)

The OAIS Reference Model: current implementations

The Observation of Bahasa Indonesia Official Computer Terms Implementation in Scientific Publication

Building Interoperable Digital Libraries: A Practical Guide to creating Open Archives

MINT METADATA INTEROPERABILITY SERVICES

Adding OAI ORE Support to Repository Platforms

Repository Interoperability

Introduction

The multi-faceted use of the OAI-PMH in the LANL Repository

SobekCM METS Editor Application Guide for Version 1.0.1

30 April 2012 Comprehensive Exam #3 Jewel H. Ward

SciX Open, self organising repository for scientific information exchange. D15: Value Added Publications IST

How to Create a Custom Ingest Form

B2SAFE metadata management

NEEO TECHNICAL GUIDELINES FOR THE

Overview NSDL Collection Representation and Information Flow

Expected and Unexpected Synergies

Building for the Future

oatd.org Discovery for Open Access Theses and Dissertations An ASERL Webinar, October 15, 2013 These slides:

Outline of the course

D4.8 Report on semantic interoperability with Europeana

Network Information System. NESCent Dryad Subcontract (Year 1) Metacat OAI-PMH Project Plan 25 February Mark Servilla

GMA-PSMH: A Semantic Metadata Publish-Harvest Protocol for Dynamic Metadata Management Under Grid Environment

The OAI2LOD Server: Exposing OAI-PMH Metadata as Linked Data

OAI-PMH for dummies: how to build an institutional repository with limited resources?

The Design of a DLS for the Management of Very Large Collections of Archival Objects

Best Practice Guidelines for the Development and Evaluation of Digital Humanities Projects

Content Management for the Defense Intelligence Enterprise

Working with Islandora

Advanced Tooling in MarcEdit TERRY REESE THE OHIO STATE UNIVERSITY

Update on the TDL Metadata Working Group s activities for

Showing it all a new interface for finding all Norwegian research output

Corso di Biblioteche Digitali

Tutorial. Open Archive Initiative

Exposing and Harvesting Metadata Using the OAI Metadata Harvesting Protocol: A Tutorial

CONTENTdm & The Digital Collection Gateway New Looks for Discovery and Delivery

Nuno Freire National Library of Portugal Lisbon, Portugal

adore: a modular, standards-based Digital Object Repository

Integrating TEI and EAD

COAR Interoperability Roadmap. Uppsala, May 21, 2012 COAR General Assembly

Increasing access to OA material through metadata aggregation

A Novel Architecture of Agent based Crawling for OAI Resources

DELIVERABLE. D2.2: Modified MINT prototype. LoCloud. Local content in a Europeana cloud. Project Acronym: Grant Agreement number:

Development of an Ontology-Based Portal for Digital Archive Services

Citation Services for Institutional Repositories: Citebase Search. Tim Brody Intelligence, Agents, Multimedia Group University of Southampton

Ponds, Lakes, Ocean: Pooling Digitized Resources and DPLA. Emily Jaycox, Missouri Historical Society SLRLN Tech Expo 2018

Metadata Standards and Applications. 4. Metadata Syntaxes and Containers

Citation Services for Institutional Repositories: Citebase Search. Tim Brody Intelligence, Agents, Multimedia Group University of Southampton

Web & Android Application for Harvester of Indonesian Scientific Paper Citation

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

The adore Federation Architecture

Transcription:

CodeSharing: a simple API for disseminating our TEI encoding 1. Introduction Martin Holmes Although the TEI Guidelines are full of helpful examples, and other inititatives such as TEI By Example have made great progress in providing more access to samples of text-encoding to help beginners get started, there is no doubt that one of the biggest obstacles to encoders at many levels is finding out how other scholars and projects have chosen to encode a particular feature or use a specific tag or attribute. Burghart and Rehbein tell us that the majority of TEI users are self-taught or learned by doing, and Dee (forthcoming) reports that users need a source for a compendium of examples suitable for inductive learning. Many projects now share their XML code, but that in itself is only marginally helpful; it can take substantial time to sift through the XML code in a large project to find what you're looking for. This talk presents a simple specification for an Application Programming Interface, along with a sample implementation written in XQuery and designed for the exist XML database, providing straightforward access both for applications and end-users to sample code from any TEI project. The API is modeled on the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), a mechanism designed to allow archival search tools to ingest metadata from repositories. The CodeSharing protocol and the sample implementation were first presented at the Digital.Humanities@Oxford Summer School in July 2013, and a number of improvements 1 / 12

have been made to the code and specification based on feedback from that presentation. 2. Target audience The CodeSharing proposal arises out of two separate but intersecting needs: those of novice encoders and project managers who are not really TEI experts, and those of people doing research into encoding practices on a large scale across multiple projects. 2.1. Novice encoders At the time of writing, The Map of Early Modern London (MoEML) project, a typical grant-supported digital humanities project with a large encoding component, has a team of around seven or eight encoders working part-time, with frequent changes of personnel. The project provides regular TEI training, but there is usually a mix of experienced and rookie encoders, and project managers have to provide a lot of live help to supplement the documentation. One of the most difficult things for inexperienced encoders is to find examples of the usage of elements and attributes they haven't used before. There are of course lots of resources to help with this, including the TEI Guidelines examples, the TEI by Example project, and Marjorie Burghart's Cheatsheets, but it is unusual to find in such external resources enough examples of the use of a particular element or attribute which exactly match the current use-case. It is really more effective to search the codebase of your own project. The MoEML encoders do have access to a lot of the existing codebase, but doing text searches of this is often ineffective. They are not normally familiar with XPath, XQuery, or regular expressions, and most will never learn them, so in searching for e.g. a <birth> element which has a @notbefore-custom attribute, they will search for <birth notbefore-custom, and 2 / 12

therefore miss all the <birth> elements which happen to have their @datingmethod attribute first, or which have two spaces between the tag name and the attribute name instead of one. Searching the entire codebase may also retrieve examples of encoding in obsolete or unedited documents, which might provide bad examples. It is more effective, therefore, to enable them to search the online XML database which contains only documents which have actually been published. To serve these needs, we began to think about writing a simple search interface which would form part of our MoEML web application, and which would provide access to lots of examples of individual tags and attributes. 3 / 12

Figure 1. The Web interface of the CodeSharing service on the MoEML site. This straightforward form-based interface enables our encoders to retrieve examples of encoding quickly and easily from across our text collection. 2.2. Research into encoding practices As I worked on the interface above, I also began to think about broader possibilities. In our work on the TEI Council, we frequently find ourselves asking: Do people ever actually use this element or attribute? If so, how do they use it? 4 / 12

We typically resort to posting questions to the TEI-L mailing list to ask for examples from the community, but this is a rather slow, hit-and-miss method of gathering data. If we could retrieve large numbers of sample usages of particular elements or attributes and analyse them, this would be very helpful in development of the Guidelines, and in assessing the impact of any change we might make to the schemas or recommendations. We might also be able to supply fresh examples for the Guidelines text itself by selecting from among thousands of sample usages, rather than drawing only from our own projects or concocting examples. It would therefore be very helpful if there were an API that projects could implement to provide access to samples of their encoding practice in a standardized way, such that the same query could be run over multiple projects simultaneously and the results combined. Figure 2. A CodeSharing service can serve the needs of novice encoders as well as encoding researchers. 3. The model: OAI-PMH To answer these needs, I designed a web service that could be provided by large- and medium-scale encoding projects, enabling anyone to gather examples of their encoding practice 5 / 12

directly from their data. I modelled my protocol on an existing, well-tested system: the Open Archives Initiative Protocol for Metadata Harvesting, or OAI-PMH 1, which I had previously implemented for another project. OAI-PMH is commendably simple and well designed. A participating repository may be a data provider, which exposes structured metadata through a web service implementing the OAI-PMH API; or a service provider, which gathers that metadata through requests to the data providers. The service providers can then act as meta-repositories, or federated archives, providing search functionality which encompasses the collections of all the data providers who have exposed their metadata for harvesting. An example is OCLC's WorldCat 2, which aggregates data from a large number of repositories and makes them searchable from a single interface. The OAI-PMH API is based on HTTP requests using GET or POST. It is designed to allow a harvester to find out what kinds of resources a repository has, and to gather full metadata records. All responses are in XML, and conform to a standard schema. OAI-PMH is based around six core verbs: Identify ListMetadataFormats ListIdentifiers ListSets ListRecords GetRecord A request like this 3 : http://bcgenesis.uvic.ca/oai.xq?verb=identify will result in a response in 1. http://www.openarchives.org/oai/2.0/openarchivesprotocol.htm 2. http://www.worldcat.org/ 3. This request is made to the OAI-PMH interface provided by the Colonial Correspondence of BC and Vancouver Island project, at http://bcgenesis.uvic.ca. 6 / 12

XML which provides identifying information about the repository: <OAI-PMH xsi:schemalocation="http://www.openarchives.org/oai/2.0/ http://www.openarchives.org/oai/2.0/oai-pmh.xsd"> <responsedate>2013-05-08t18:09:31.047z</responsedate> <request verb="identify">http://bcgenesis.uvic.ca/oai.xq</request> <Identify> <repositoryname>the colonial despatches of Vancouver Island and British Columbia 1846-1871</repositoryName> <baseurl>http://bcgenesis.uvic.ca/oai.xq</baseurl> <protocolversion>2.0</protocolversion> <adminemail>mholmes@uvic.ca</adminemail> <adminemail>cpetter@uvic.ca</adminemail> <earliestdatestamp>2012-11-19t12:00:00z</earliestdatestamp> <deletedrecord>no</deletedrecord> <granularity>yyyy-mm-dd</granularity> <description> <oai-identifier xsi:schemalocation="http://www.openarchives.org/oai/2.0/oai-identifier http://www.openarchives.org/oai/2.0/oai-identifier.xsd"> <scheme>oai</scheme> <repositoryidentifier>bcgenesis.uvic.ca</repositoryidentifier> <delimiter>:</delimiter> <sampleidentifier>oai:bcgenesis.uvic.ca:b63030sp.scx</sampleidentifier> </oai-identifier> </description> </Identify> </OAI-PMH> The harvester can use the ListRecords verb to retrieve individual records, which are typically in the form of Dublin Core elements embedded in a larger structure in the OAI namespace: <record> <header> <identifier>oai:bcgenesis.uvic.ca:b585te13.scx</identifier> <datestamp>2013-12-17t14:29:44.553-08:00</datestamp> <setspec>1858</setspec> <setspec>publicoffices</setspec> <setspec>b.c.</setspec> 7 / 12

</header> <metadata> <oai_dc:dc xsi:schemalocation="http://www.openarchives.org/oai/2.0/oai_dc/ http://www.openarchives.org/oai/2.0/oai_dc.xsd"> <dc:title>the colonial despatches of Vancouver Island and British Columbia 1846-1871: 11566, CO 60/2, p. 291; received 13 November. Trevelyan to Merivale (Permanent Under-Secretary)</dc:title> <dc:date>1858-11-12</dc:date> [...] </oai_dc:dc> </metadata> </record> Most repositories will have thousands of records, and retrieving them all at once would place an unacceptable burden on the server and network infrastructure, so OAI-PMH has a built-in staging system. Records are supplied in batches, and each batch ends with a resumption token which acts as a flow-control device: <resumptiontoken completelistsize="7154"> from:2010-01-01;until:2020-01-01;set:;next:21 </resumptiontoken> The format of the resumption token is not defined by the specification; the data provider may use any suitable format, and the harvester simply has to echo it back to the data provider to get the next set of records. This gives the data provider complete control over how rapidly it provides the data, and even in the course of a large transfer, the provider can throttle or accelerate its provision of records in response to local conditions such as server load or network bandwidth. 4. The CodeSharing service The complete specification for the CodeSharing API is available at http://mapoflondon.uvic.ca/codesharing_protocol.xhtml, as part of the sample implementation on 8 / 12

the MoEML site; in what follows I cover only some key aspects of it. 4.1. The request API CodeSharing is an XML-based API provided over HTTP, just like OAI-PMH. On the model of OAI-PMH, it's also based on a verb parameter, with five possible values: identify listelements listattributes listnamespaces getexamples The first four are designed to discover the nature of the repository, and what elements, attributes and namespaces occur in its document collections. The final value is a request for actual example encodings; when that value is used, other key-value pairs provide details about what has been requested: elementname attributename attributevalue Each parameter is used to specify what examples are being requested. These can obviously be combined, so a request for: elementname=hi&attributename=rend will retrieve only hi elements which have @rend attributes; attributename=rend&value=italic will retrieve any elements which have @rend="italic". Two further parameters are available: namespace (the namespace for requested elements, defaulting to the TEI namespace) wrapped (whether or not to return the parent containing the target element) 9 / 12

So a request for: elementname=hi&wrapped=true will retrieve <hi> elements in the context of their parent element. Encoders find it helpful to see not just the target element, but the surrounding context too. Finally, we have to consider flow control, as in the case of OAI-PMH. It would be disastrous to attempt to honour a request for all of the <TEI> elements in a large collection; we need to negotiate a reasonable chunk size for the harvester and the server. In the case of OAI-PMH, the data provider always dictates the number of records it is prepared to supply. In the case of CodeSharing, I wanted to allow a little more flexibility, so there is a further parameter the harvester can provide: maxitemsperpage (a positive integer) The provider service is not required to honour this request; it may decide to send fewer items. But it may be sensitive enough to know, for instance, that when the request is for a relatively small element such as <hi>, it might be practical to send 100 examples in one response, whereas it might send only 20 in the case of requests for <p> or <div> elements. 4.2. The response What form should the server's response take? The obvious answer is that it should be XML, and in fact that it should be TEI P5 XML. The exact format of the response document is only loosely specified, although some parts of it must follow certain rules. If the value of the verb parameter is listelements, for instance, then the body of the document must contain the list of all elements appearing in the collection as a list: <list rend="bulleted"> <item> 10 / 12

<gi>author</gi> </item> <item> <gi>availability</gi> </item> <item> <gi>back</gi> </item> [...] </list> Similar structures are used to list attributes and namespaces. For returning actual examples, CodeSharing makes use of the <egxml> element which is also used for example code in the TEI Guidelines. The <egxml> element is in its own special namespace, http://www.tei-c.org/ns/examples, and all the elements which are children of it, in the example code, are also by default in that namespace. This is useful, because it means that we can easily distinguish example code from other parts of the TEI file. (It also means we can use the API to retrieve examples of code which themselves are intended as examples in their original context.) In addition to the results of the query, the protocol specification also requires that the parameters of the original request be returned to the requestor; this means that the result document is a complete and self-contained record of the query and results. Full details are available in the protocol documentation. 5. A sample implementation A sample implementation of the CodeSharing protocol, including an HTML front-end, as shown in Figure 1, is available at http://mapoflondon.uvic.ca/codesharing.htm. It is written in XQuery 11 / 12

3.0 and runs in the exist XML database which hosts the MoEML web application. The open-source code is available on SourceForge at https://sourceforge.net/projects/codesharing/, and includes these files: codesharing.xql (the XQuery implementation providing responses to queries in XML) codesharing_config.xql (a simple settings file that tailors the service to your own project) codesharing.xsl (a transformation which produces the HTML search page you see on the MoEML site) codesharing_protocol.xhtml (a semi-formal description of the API) codesharing.odd (an ODD file from which a schema can be generated to validate CodeSharing API responses) I would welcome any contributions in the form of improvements, suggestions, and implementations in other languages. Bibliography Burghart, Marjorie, and Malte Rehbein. 2012. The Present and Future of the TEI Community for Manuscript Encoding. Journal of the Text Encoding Initiative 2. http://jtei.revues.org/372. doi:10.4000/jtei.372. Dee, Stella. Forthcoming. Learning the TEI in a Digital Environment. Journal of the Text Encoding Initiative 7. Open Archives Protocol for Metadata Harvesting (OAI-PMH) Version 2.0. 2008. http://www.openarchives.org/oai/2.0/openarchivesprotocol.htm. Van den Branden, Ron, Melissa Terras, and Edward Vanhoutte. 2010. TEI by Example. http://www.teibyexample.org/. 12 / 12