Next-Generation Characterization An Update on the JHOVE2 Project

Similar documents
Open Preservation Foundation and The Preservation Action Registry. Martin Wrigley, Executive Director, OPF

Contents: *** Draft ***

JHOVE2: Next-Generation Architecture for Format-Aware Characterization JHOVE2 Programmer's Guide

Digital Formats and Preservation

Global Digital Format Registry (GDFR) An Interim Status Report

JHOVE2: Next-Generation Architecture for Format-Aware Characterization JHOVE2 User s Guide

verapdf after PREFORMA

SIP AIP AIP DIP. Preservation Planning. Data Management. Ingest. Access. Archival Storage. Administration MANAGEMENT P R O D U O N S U M E R E R 4-1.

XML: Examining the Criteria to be an Open Standard File Format

Session Two: OAIS Model & Digital Curation Lifecycle Model

Robin Dale RLG

Digging into File Formats: Poking around at data using file, DROID, JHOVE, and more

k o L i b R I kopal Library for Retrieval and Ingest Documentation

A Micro-Services-Based Approach for Curation and Preservation Solutions

Digital Preservation DMFUG 2017

The OAIS Reference Model: current implementations

Long-term digital preservation of UNSWorks

The Development of Digital Preservation Best Practices in EPrints. OR2012 : The 7 th International Conference on Open Repositories

Building for the Future

Collaboration in Digital Preservation. Barbara Sierman Digital Preservation Manager, KB NL CSC Espoo,

Importance of cultural heritage:

Dr. MIQUEL MONTANER CTO at Easy Innova. Dr. VÍCTOR MUÑOZ R&D Manager at Easy Innova. XAVI TARRÉS Project Manager at Easy Innova

Digital Preservation at NARA

Assessing the Utility of Current Format Registry Efforts for Geospatial Formats

Systems Interoperability and Collaborative Development for Web Archiving

Preservation Regression Tests - Test & Execution - Internal MP3 to WAV

Co-operative Development of a Long-term Digital Information Archive

Preservation Planning in the OAIS Model

Andrea Goethals, Harvard Library ASERL Webinar File Information Tool Set

Susan Thomas, Project Manager. An overview of the project. Wellcome Library, 10 October

Assessment of product against OAIS compliance requirements

Digital Preservation: From Theory to Practice

Ensuring Proper Storage for Earth Science Data: The USGS Process to Certify Trusted Digital Repositories

The Making of PDF/A. 1st Intl. PDF/A Conference, Amsterdam Stephen P. Levenson. United States Federal Judiciary Washington DC USA

Characterisation. Digital Preservation Planning: Principles, Examples and the Future with Planets. July 29 th, 2008

DIGITAL ARCHIVES & PRESERVATION SYSTEMS

What is Islandora? Islandora is an open source digital repository that preserves, manages, and showcases your institution s unique material.

Its All About The Metadata

XML information Packaging Standards for Archives

Institutional Repository using DSpace. Yatrik Patel Scientist D (CS)

verapdf Industry supported PDF/A validation

Assessment of product against OAIS compliance requirements

This document is a preview generated by EVS

Developing a Framework for File Format Migrations. ipres 2015 Chapel Hill, NC 3 November 2015 Joey Heinen and Andrea Goethals

An Introduction to PREMIS. Jenn Riley Metadata Librarian IU Digital Library Program

Interactive Knowledge Stack A Software Architecture for Semantic Content Management Systems

verapdf: definitive, open source PDF/A validation for digital preservationists

An overview of the OAIS and Representation Information

UWDCC Case Study. Supplementing DPOE Workshop 6 June Steven Dast Digital Asset Librarian UW Digital Collections Center

Writing a Data Management Plan A guide for the perplexed

Any comments, corrections, or recommendations may be sent to the project team, care of:

What do you do when your file formats become obsolete? Lydia T. Motyka Florida Center for Library Automation USETDA 2011

RUtgers COmmunity REpository (RUcore)

PDF/arkivering PDF/A. Per Haslev Adobe Systems Danmark Adobe Systems Incorporated. All Rights Reserved.

Preservation Metadata Extraction and Collection : Tools and Techniques. Mat Black National Library of New Zealand Te Puna Matauranga o Aotearoa

Fedora, Islandora, & Samvera: Requirements and Gaps. David Wilcox,

Metadata and Encoding Standards for Digital Initiatives: An Introduction

Protection of the National Cultural Heritage in Austria

<goals> 10/15/11% From production to preservation to access to use: OAIS, TDR, and the FDLP

Number 40 September Editor: Lynn Campbell. CONTENTS Stephen Clarke Archives NZ

Current Digital Preservation Trends and SDB4 Features. Pauline Sinclair, Tessella, PASIG, Madrid, 5 July 2010

ISO INTERNATIONAL STANDARD. Document management Engineering document format using PDF Part 1: Use of PDF 1.6 (PDF/E-1)

Introduction to. Digital Curation Workshop. March 14, 2013 SFU Wosk Centre for Dialogue Vancouver, BC

Repository models and policies for preservation

Digital Preservation Research Initiatives at NLNZ. SAA Research Forum Austin, Texas 11 August 2009

Digital Preservation and The Digital Repository Infrastructure

Challenges and Successes: Running the Rosetta Format Library

The e-depot in practice. Barbara Sierman Digital Preservation Officer Madrid,

The RMap Project: Linking the Products of Research and Scholarly Communication Tim DiLauro

The Definitive PDF/A Conformance Checker

Greenstone Publications

XML Paper Specification (XPS)

Digital Preservation Workshop

Digital Preservation Workshop

Harvesting Metadata Using OAI-PMH

Emulation: the debate, registries and PREMIS

The Future of PDF/A and Validation

Archiving and Migrating Digital Files

The Archive Ingest and Handling Test: The Johns Hopkins Universi...

Digitisation Standards

Emulation for digital preservation in practice: the results

PREMIS in Archivematica

The Storage Networking Industry Association (SNIA) Data Preservation and Metadata Projects. Bob Rogers, Application Matrix

Digital preservation activities at the German National Library nestor / kopal

PRESERVING DIGITAL OBJECTS

Sustaining and Enhancing the Value of Digital Assets

Below is an example workflow of file inventorying at the American Geographical Society Library at UWM Libraries.

The Promise of PREMIS: background, scope and purpose of the Data Dictionary for Preservation Metadata

The Choice For A Long Term Digital Preservation System or why the IISH favored Archivematica

Web-based workflow software to support book digitization and dissemination. The Mounting Books project

Mitigating Preservation Threats: Standards and Practices in the National Digital Newspaper Program

CDL s Web Archiving System

Document Title Ingest Guide for University Electronic Records

Archival Information Package (AIP) E-ARK AIP version 1.0

Enabling Collaboration for Digital Preservation

DRI: Dr Aileen O Carroll Policy Manager Digital Repository of Ireland Royal Irish Academy

Create your own Carbon Component. Sameera Jayasoma Technical Lead and Product Manager of WSO2 Carbon

PASIG Directions & Issues

This work is licensed under the Creative Commons Attribution 4.0 International License. Page 1 of 10

Draft Digital Preservation Policy for IGNCA. Dr. Aditya Tripathi Banaras Hindu University Varanasi

Transcription:

NDIIPP Partners Meeting Arlington, Virginia, July 20-22, 2010 Next-Generation Characterization An Update on the JHOVE2 Project JHOVE2 Project Team California Digital Library, Portico, Stanford University

The preservation problem Managing the gap between what you were given and what you need That gap is only manageable if it is quantifiable Characterization tells you what you have, as a stable starting point for iterative preservation planning and action Characterization Preservation action Preservation planning Adopted from A. Brown, Developing Practical Approaches to Active Preservation, IJDC 2:1 (June 2007): 3-11.

Tell me about yourself United Features Syndicate, Inc.

What? So what? Characterization is the automated determination of the intrinsic and extrinsic properties of a formatted object Identification Feature extraction Validation Assessment What is it? What about it? What is it, really? So what?

Validation vs. assessment Validation is the determination of the level of conformance to the normative requirements of a format s authoritative specification To the extent that there is community consensus on these requirements, validation is an objective determination Assessment is the determination of the level of acceptability for a specific purpose on the basis of locally-defined policy rules Since these rules are locally configurable, assessment is a subjective determination

We report, you decide Fox News Network LLC

Characterization in ingest workflows Producer Archive Content Content Identification Feature extract Validation Assessment Package SIP Unpackage Identification Feature extract Validation Consistency Assessment Ingest Policy rules Metadata Policy rules Metadata Metadata

Characterization in migration workflows Content Content AIP Unpackage Assessment Migration Identification Feature extract Validation Equivalence (Re)Ingest Policy rules Metadata Metadata

JHOVE2 project Build on the success of JHOVE, addressing some of its known deficiencies of design and implementation, and extending its function Collaboration of CDL, Portico, and Stanford Funded by NDIIPP Open source deliverables (BSD)

Feature set Multi-stage processing Signature-based identification DROID http://droid.sourceforge.net/ Feature extraction Validation Message digesting Adler-32, CRC-32, MD2, MD5, SHA-1, SHA-256, SHA-384, SHA-512 Rules-based assessment Processing of objects spanning files and objects that are subsets of files Recursive processing of objects arbitrarily-nested within containers

Feature set Granular modularization with generic plug-ins Clean APIs and common module design patterns Buffered I/O Internationalized output Extensive configuration via dependency injection Complete documentation User s guide Architectural overview Module specifications Programmer s guide

Supported formats JHOVE2 can identify (by DROID) many more formats than it can validate (by modules) PRONOM registry documents over 550 formats http://www.nationalarchives.gov.uk/pronom

Supported formats ICC color profile (ICC.1:2004-10) JPEG 2000 JP2 (ISO/IEC 15444-1), JPX (ISO/IEC 15444-2) PDF SGML Shapefile PDF 1.0 1.7, ISO 3200-1, PDF/A-1 (ISO 19005-1), PDF/X-1 (ISO 15920-1), -1a (ISO 15930-4), -2 (ISO 15930-5) -3 (ISO 15930-6) Main, Index, dbase, TIFF TIFF 4 6, Class B, F, G, P, R, Y, TIFF/EP (ISO 12234-2), UTF-8 WAVE XML Zip TIFF/IT (ISO 12639), GeoTIFF, Exif (JEITA CP-3451), DNG ASCII (ANSI X3.4) BWF (EBU N22-1997)

Supported formats netcdf http://www.unidata.ucar.edu/software/netcdf Grib http://www.wmo.int/pages/prog/www/wdm/guides/guide-binary-2.html Developed by the Wegener Institute (Germany) http://www.awi-potsdam.de Widely used for meteorological data

(Un)supported formats AIFF GIF HTML JPEG HTML can be expressed in terms of SGML or XML We re investigating funding options for subsequent development of GIF and JPEG modules

Java 1.6 J2SE http://java.sun.com/javase/6/docs/api Implementation Annotations http://java.sun.com/javase/6/docs/technotes/guides/language/annotations.html Buffered I/O (java.nio) http://java.sun.com/javase/6/docs/api/java/nio/package-summary.html Reflection http://java.sun.com/docs/books/tutorial/reflect Spring dependency injection framework http://www.springframework.org/ Mercurial distributed code repository http://mercurial.selenic.com/ Maven build management http://maven.apache.org/ Bitbucket code hosting http://www.bitbucket.org/

Properties and reportables A property is a named, typed value Name Unique formal identifier Data type Scalar or collection Java types, JHOVE2 primitive types, or JHOVE2 reportables Typed value Description of correct semantic interpretation A reportable is a named set of properties Reportables correspond to Java classes Properties correspond to fields

Source units A formatted object about which characterization information can be meaningfully reported Unitary File File inside of a container Byte stream inside a file Aggregate Directory Directory inside of a container File set Clump e.g. TIFF e.g. TIFF inside a Zip e.g. ICC inside a TIFF e.g. command line arguments e.g. Shapefile For purposes of characterization, directories, file sets, and clumps are considered formats

Characterization strategy 1. Identify format 2. Dispatch to appropriate format module a) Extract format features and validate If a nested source unit is found, process recursively, (go to Step 1) b) Validate format profiles (optional) 3. If unitary source unit, calculate message digests 4. Assess 5. If aggregate source unit, try to identify aggregate format, and if successful, process recursively (go to Step 1)

Characterization strategy directory/ abc.shp abc.shx abc.dbf abc.tif xyz.pdf

Characterization strategy directory/ abc.shp abc.shx abc.dbf abc.tif xyz.pdf Main Index dbase GeoTIFF PDF

Characterization strategy directory/ Shapefile clump abc.tif xyz.pdf GeoTIFF PDF abc.shp abc.shx abc.dbf Main Index dbase

Characterization strategy directory/ GIS object clump xyz.pdf PDF Shapefile clump abc.tif GeoTIFF abc.shp abc.shx abc.dbf Main Index dbase

Assessment Evaluation of prior characterization information relative to local policy Assessment results can inform preservation decision making Determine level of risk Assign level of service Take action now or later

Assessment Assessment rules are logical expressions of the form If condition then consequent else alternative A condition is defined by either a universal or existential qualifier for all there exists or for any and an arbitrary set of predicates (logical assertions) of the form property relation value Supported relational operators ==!= < > =< => contains

Assessment XML rule example (pseudocode) If ALL_OF xmldeclaration.standalone == 'yes valid.tostring() == 'true' Then Acceptable Else Not acceptable End If Predicates are evaluated using MVEL http://mvel.codehaus.org/

Demonstration % jhove2 [-ik] [-b size] [-B Direct NonDirect Mapped] [-d JSON Text XML] [ f limit] [ t temp] [-o file] file... -i Show identifiers in JSON and Text displayers -k Calculate message digests -b size I/O buffer size, in bytes (default: 131072) -B type I/O buffer type: Direct, NonDirect, Mapped (default: Direct) -d displayer Displayer: JSON, Text, XML (default: Text) -f limit Fail fast limit (default: 0, no limit) -t temp Temporary directory -o file Output file (default: standard output) file File or directory

User survey 145 respondents, 88 institutions, 23 countries Full results available at https://confluence.ucop.edu/display/jhove2info/user+survey

User survey Full results available at https://confluence.ucop.edu/display/jhove2info/user+survey

Sustainability Final production release in September 2010 Workshop at ipres 2010, Vienna, September 19-24 http://www.ifs.tuwien.ac.at/dp/ipres2010 Project partners will provide ongoing, self-funded maintenance (but not development) Funded development activities Integration with DuraCloud (DuraSpace) ARC and WARC modules (Bibliothèque nationale de France)

Sustainability Possible development efforts Additional format modules Configuration GUIs JHOVE2-as-a-service Integration with DAITTS, DSpace, Fedora, FITS, etc. Training and tutorials Train the trainer Look for a permanent institutional home

Questions? http://jhove2.org JHOVE2-Announce-L@listserv.ucop.edu JHOVE2-Techtalk-L@listserv.ucop.edu CDL Stephen Abrams Patricia Cruse John Kunze Isaac Rabinovitch Marisa Strong Perry Willett Stanford University Richard Anderson Tom Cramer Hannah Frost Portico John Meyer Sheila Morrissey Library of Congress Martha Anderson Justin Littman With help from Walter Henry Nancy Hoebelheinrich Keith Johnson Evan Owens Advisory Board Deutsche Nationalbibliothek Dspace / MIT Ex Libris Fedora Commons / Rutgers Florida Center for Library Automation Harvard University Koninklijke Bibliotheek National Archives (UK) National Archives (US) National Library of Australia National Library of New Zealand Planets / Universität zu Köln Tessella