Neil Jefferies Tanya Gray Jones Bodleian Libraries

Session Structure Metadata and Data Modelling using the Prov Ontology

Objects Common objects reappear in many places: Items Works, (Manifestations) Artefects, Components Labels Classifications, Vocabularies, Ontologies, Names, Attribute values Sort and group items These are vital for discovery (not everything is full- text indexable) Context Places, People, Geopolitical entities, Collections Locate items It is *possible* for something to be more than one types of object Fictitious creations, automata Objects have Attributes Literal (properties) Relationships to other objects Internal structure

Important Considerations The Model should fit the Knowledge If you are working hard to make your information fit then you are using the wrong approach Don t sacrifice accuracy for conformance Standards have implicit biases and assumptions Affects the types of question that can be asked or answered Efficiency matters! Preservation Economics of re- use File format choice Significant properties Metadata is critical* Re- use Final format vs continued use Cannot anticipate how Most potential users not born

No need for a single approach Standards suffer from scope creep Handle their initial design targets well and everything else rather less so Author Digitised Images Book Sooner or later your information will become graph- like MODS EXIF RDF types relationships, unlike an vcard (Bibliographic) (photographic) RDBMS RDF (like many standards) can PREMIS ALTO (text technically encode almost anything (Preservation) coordinates) but Different knowledge types are best treated differently CC- BY- SA (Rights) Text (OCR Output) Mashing it all together is confusing and reduces reusability Text (Abstract) JPEG (Image) It is also inefficient There are existing standards (W3C/IETF > DH > Library) TIFF (Image)

Data and Metadata Questions? Context Provenance Evidence Qualification

A False Dichotomy (partly) An artefact derives much of its meaning from attributes that are not intrinsic to the artefact itself Context - the circumstances under which it was created Provenance - the route by which it came to be where it is now This is especially true for digital materials A file is a meaningless stream of bytes The name can be readily changed it is not intrinsic The file format is not intrinsic text vs XML vs HTML vs TeX The metadata alone can have more meaning than the data alone Can we even unambiguously define metadata? Image vs transcription vs abstract vs description A digital object should be considered a greater whole comprising several streams of information that can be arbitrarily labelled data or metadata but all of which contribute to the intellectual content of the object

Original Context The original context in which the Context artefact was created Current context is the product of provenance Who created it? Author, illustrator, scribe, typesetter, printer, publisher? Why did they create it? How did they create it? Provides evidence for Gives meaning to Where and when did they create Artefect it? What was going on when they created it?

Context is shared The Paradise of Dainty Devises??? Chemistry of Insulin: determination of the structure of insulin opens the way to greater understanding of life processes Nucleotide sequence of 5S- ribosomal RNA from Escherichia coli Hitchikers Guide to the Galaxy The Restaurant at the End of the Univerese So Long and Thanks for all the Fish

Provenance How an artefact came to where and how it is now? How a digital surrogate was created/curated etc. Digital and physical in parallel Conservation and preservation applies to both The basic questions are framed in similar terms to original context but with an emphasis on Time and Process The original context is just the early part of provenance! Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister... <HTML><HEAD><Title>Alice's Adventures in Wonderland - - Chapter I</Title></HEAD><BODY>

Provenance/Context Models Key components: Objects (entities, things ) Events located in space and time Agent Participates In Agents: Create/change other entities/relationships Items: artefacts, people, places Labels: Classifications, ontologies, ideas, bibliographic works Event Which changes Groups: Organisations, collections (geopolitical constructs) Relationships (typed) Item CIDOC- CRM www.cidoc- crm.org ISO 21127 UNESCO/International Council of Museums Schema.org www.schema.org (roles and events recently added!) Google, Bing!, Yahoo etc. TEI www.tei- c.org PREMIS www.loc.gov/standards/premis Preservation metadata originally (3.X is a significant revision)

Evidence Data models are about assertions *NOT* truth or reality! Provenance of assertions about objects matters this is a key mechanism of scholarship: Who made the assertion? When? On what basis? Assertions may be multiple or contradictory Some use cases attempt to compute confidence or probability values (!) In practice This can be and is ignored for some cases (intrinsic properties of an object) This is often the starting point for further research (library catalogue, pre- existing data)

Expressing Evidence Most evidence can be accommodated by adopting an event- oriented expression of information The mechanism used for expressing context and provenance also works here Author BirthDate BirthPlace Manuscript Author AuthorBirthDate PlaceofCreation Creation Event Time Place Evidence DateofCreation EvidenceForAuthorPlaceDateOfCreation Title Abstract Manuscript CurrentLocation DateOfDepositAtCurrentLocation EvidenceForDateOfDepositAtCurrentLocation Title Abstract Deposit Event Time Place Evidence

Another Viewpoint We can reframe the previous discussion in terms of a general need to be able to qualify an assertion in terms one or more of: Time When an assertion is true An obvious case, the existence of a person Place Where an assertion is true Professor of History at Oxford <> Professor of History at Heidelberg Places can be geopolitical entities such as jurisdictions Which are themselves time dependent Source Who made the assertion An anonymous text is a valid source though Evidence - Why the assertion has been made and counter- evidence too Confidence How much can the assertion be trusted Often depends on the source and evidence

Different Knowledge Types Increasing Uncertainty Need for Qualification Derived Knowledge History Meaning Relate Semantic Elements to other objects Who, When, Where Iconography Context Immediate information available from the object environment Metadata! Creator, Location in Library, Accession Documented Provenance Semantic Elements Meaningful chunks of content Titles/Subtitles, Personal Names, Place Names, Contents Lists, Indices, Dates Image Components Intrinsic Information Raw information content Raw Text, Lines, Headers, Pagination, Images Text Coordinates Physical Attributes Material, Page Size, Font, Colour

Modelling using the Prov Ontology Questions The Semantic Web and RDF The Prov Ontology Examples

The Semantic Web The Semantic Web Tim Berners- Lee (1998), Semantic Web Roadmap. http://www.w3.org/designissues/semantic.html Key components URI (Uniform Resource Indicator) to indicate where things can be found online Unicode (multilingual at the outset) RDF (Resource Description Framework) The Semantic way of expressing information as triples XML (Extensible Markup Language) One way of encoding RDF information Others formats such as JSON- LD are used RDF- S Used for expressing RDF schemas (in RDF) OWL (Web Ontology Language) General mechanism for expressing ontologies/vocabularies/schemas Superset of RDF- S and a lot more complex (also OWL- Lite) RIF (Rule Interchange Format) Intended to define rules for processing RDF, actually maps between many existing rule formats SPARQL (Simple Protocol and RDF Query Language) Query language for RDF usually run against a triple store Crypto encryption and signing technologies to ensure data can be transmitted securely Phew! Fortunately, it is possible to generate RDF without knowing about much of this! If you need to there are tools available!

Linked Open Data Semantic Web is necessary but not sufficient Tim Berners- Lee (2006), Linked Data. http://www.w3.org/designissues/linkeddata.html Four rules: Identify everything with URI s (avoid literals if possible) Use Web URI s i.e. URL s Return meaningful semantic information when a URI is requested this could be simple RDF or a SPARQL endpoint Make links 2010 addendum - Five Stars for Linked Open Data Available on the web (whatever format) but with an open licence Available as machine- readable structured data (text rather than scan) as (2) plus: Non- proprietary format (e.g. CSV instead of excel) as (3) plus: Use RDF and SPARQL, so that people can point at your stuff as (4) plus: Link your data to other people s data to provide context URI s are now IRI s (Internationalised Resource Identifier) With added Unicode Support

Basic RDF Construct RDF 1.1 Concepts and Abstract Syntax http://www.w3.org/tr/2014/rec- rdf11- concepts- 20140225/ RDF Triple Each part may be a literal or an IRI (or a blank node) A literal has a data type and may have a controlled vocabulary (defined by RDF- S, OWL etc.) An IRI points to a resource that returns more RDF that gives more detail about the part in question Links to non- RDF data (e.g. images) are possible and necessary RDF Graph Collection of triples (which are related in some way) RDF Dataset Collection of graphs (one is default, others are named for convenience)

The PROV Ontology W3C standard http://www.w3.org/tr/prov- o

Relationship in Context The basic Prov- O relationships are rather generic so they need to be qualified Roles define how an entity relates to an activity Entity includes agents

Start Modelling I have not discussed how data is actually captured and stored this is intentional and should not be considered until you Understand what information you have Understand what questions you want to answer Understand what tools you have available Understand what additional information you need to acquire The data modelling process will help with some of these (to some extent no promises) (It s only a model) RDF can be represented in many different ways http://camelot- dev.bodleian.ox.ac.uk/ Non- RDF data Where possible, consider expressing your outputs in a similar manner this will enrich the basic dataset and allow further development

Questions

One more thing The Rules bit Can define inference rules for machine reasoning If (A is- the- son- of B) and (B is- the- son- of C) then (A is- the- grandson- of C) Simplifies data entry Enriches datasets And more