Best Practices Exchange 2013 Its All About The Metadata Mark Evans - Digital Archiving Practice Manager 11/13/2013
Agenda Why Metadata is important Metadata landscape A flexible approach Case study - KDLA Conclusions Demonstration using Preservica
Why is Metadata Important 10110100100010101010000111110010010100100100010 010101010000111100101001011011100000111101110110 10110100100010101010000111110010010100100100010 010101010000111100101001011011100000111101110110 0101010000011111001111110011111100111100011000100 10110100100010101010000111110010010100100100010 010101010000111100101001011011100000111101110110 Binary file is meaningless on it own
The Big Question How much metadata is needed and necessary to preserve digital objects
We Have Some Guidance Open Archival Information System (OAIS) Reference Model ISO Standard Well adopted Explains What not How Describes functions and information model Concept of Information Packages Three types Submission (SIP), Archival(AIP), Dissemination(DIP) Contain aggregates of information objects Data Object 10010 11010 01110 01110 Interpreted using its Representation Information Yields Information Object
OAIS Information Package Information Package can contain 4 types of Information Object Information Package Descriptive Information Content Information Packaging Information Preservation Descriptive Information Provenance Context Reference Fixity Rights Each Information Object has associated Representation Information
Example: Corresponding metadata Descriptive Metadata Descriptive Metadata Technical Metadata ID : HF2653-001-abc Author: Loefffler, Dean And Lamning Summary: A bill for an act relating to improvements to the capital area. Collection = : HF2653-001-abc Representation := file1.jpg: file2.jpg Parent record := AAABBB123 HF2653.pdf, 25654 bytes, created 10/5/2011, Valid and well formed SHA1=2323A563DF4329 Application Information Data Format PDF v1.4 Portable Document Format [fmt/41] Binary Sequence 11111111 11011000 11111111 11100000 00000000 00010000 01001010 01000110 01001001 01000110
Metadata Landscape What metadata does a digital preservation system need? Understand structural Information: Hierarchy of Records Relationships between Records Hierarchy of Files Relationships between Records & Files Understand technical Information: Technology-dependent information: Determine if actions needed (e.g., obsolete format) Technology-independent information: Verify preservation actions
Metadata Landscape II Standards, Standards, Standards!!! Dublin Core PREMIS METS EAD MODS PBCORE MIX FDGC Etc etc etc
Metadata Landscape III Lets not forget about descriptive metadata Needs to support: Holding metadata with appropriate entity in the hierarchy Allow users to view metadata Allow users to add / edit metadata Allow users to search on metadata Still convert it if needed (e.g., for export)
Have to deal with lots of ingest sources Each ingest source potentially contains metadata Unlikely to be a consistent scheme across sources Could be content specific Could be standards based Could be custom Traditional approach is to create archival metadata Can be manually intensive Potential for lots of repetition Realistic only at high levels? Can delay accessioning Which solutions support it? Metadata Sources
Metadata Sources Could convert existing metadata to a normalized form: Source 1 Source 2 Convert Schema A OAIS Digital Archive Source 3 Or force a standard on the creators
Metadata Sources This would reduce the problem However, combined schema may change over time: Potential for subsequent conversions Or cope with multiple versions May require software changes Also, each conversion is a potential point of loss Adopted archival schema may not provide full coverage for source E.g EAD arrives and DC is the adopted schema
Metadata Sources Desirable to reuse existing metadata Source 1 Schema 1 Source 2 Schema 2 OAIS Digital Archive Source 3 Schema 3 Can we cope with heterogeneous schemas? Need to examine types of metadata
A Flexible Approach Fixed schema + embedding Define schema that: Understands structural information Understands technical information eg PREMIS Embeds any descriptive metadata Embeds any additional technical metadata eg MIX Can embed multiple metadata schema for each entity Schema supports standard OAIS functions: Ingest, Access, Data Management, Storage Controls Preservation
Registering Schema Any metadata schema can be registered with SDB
Users can embed / validate XML using SIP Creator tool Either from a file Meeting Ingest Needs Or by cutting and pasting into the tool Ingest Workflow steps can be written / modified to embed metadata in XIP to support automated ingest Source metadata could be from files or other systems
Meeting Descriptive Needs Descriptive functions can use XSLT: View Transform XML to static HTML Edit Transform XML to dynamic HTML Transform - Transform XML to alternative XML schema Search: Use SOLR
Example Viewing Metadata
Example Editing Metadata
Example Editing Metadata Authorised users can: Add descriptions Add new metadata schemes Can have multiple schemas in parallel: Can keep original metadata from source systems Allows additional archival information to be kept Some potential for overlap
Schema Administration So if add new source (or source metadata changes): Upload schema Simply embed metadata to correct structural entities Add view transform Add edit transform Add transform for schema conversion (if wanted) Configure SOLR for fielded search
Advantages Easy to add new source: Very low process overhead No waiting required Metadata can be consumed as is No collection of things needing to be archived! Can do appraisal, catalogue updates etc. later as needed No loss of existing metadata: Even if transform original metadata in archival audit trail Resilient to change Reduce barriers to starting: Don t have to get it just right up front
Disadvantages Potential lack of consistency across descriptions Within a particular scheme Across differing schemes Overlap between schemas: Could have multi-schema edit transforms Fielded searches become harder Have to pick the schema to pick the field). These can be eliminated by re-cataloguing after ingest if required.
KDLA case study Existing Digital Repository DSpace Provides Public Access Limited Preservation Recent subscription of Preservica Fills a preservation gap Not intended for public access What about Metadata? Preservation Description Accessioning
KDLA Electronic Collections Web sites, Publications, Minutes, Geospatial datasets (map), Databases, Digital images, Video, and Audio recordings.
File/folder vs item Case Study of Photographs Merging a file/folder-based description system (inhouse file server and Preservica) Accession based Preserve the groupings and folder labels and attach accession metadata to every object With an item level DSpace or Contentdm DSpace Package Preservica SIP + Accessioning Metadata
PREMIS (PREservation Metadata: Implementation Strategies) Preservica s metadata schema contains a PREMIS data elements which Provides information about the provenance of the AIP PREMIS Documents relationships between: Intellectual Entity Objects Events (normalization, audits, migration) Agents (based on logon) Preservation Events: Ingest Virus scan Sensitive data scan File format conversion Checksum calculation and integrity checks Normalization
Preservica SIP Creator Ingest
Accession Information Example
Conclusions Need to use a consistent schema for: Structural metadata Technical metadata Can support multiple descriptive / technical schemas: Still allow fielded view / edit / search Hence can support: Multiple sources & ingest native schemas Multiple versions of schemas Consolidation is good but not vital: Could occur post-ingest Lowers ingest barriers, provides flexibility, minimizes loss
Questions mark.evans@tessella.com http://www.digital-preservation.com