SIP AIP AIP DIP. Preservation Planning. Data Management. Ingest. Access. Archival Storage. Administration MANAGEMENT P R O D U O N S U M E R E R 4-1.

Similar documents
Search Framework for a Large Digital Records Archive DLF SPRING 2007 April 23-25, 25, 2007 Dyung Le & Quyen Nguyen ERA Systems Engineering National Ar

The e-depot in practice. Barbara Sierman Digital Preservation Officer Madrid,

An overview of the OAIS and Representation Information

Digital Preservation DMFUG 2017

Robin Dale RLG

Assessment of product against OAIS compliance requirements

Assessment of product against OAIS compliance requirements

The OAIS Reference Model: current implementations

Building for the Future

Digital Formats and Preservation

Preserving French Scientific Data. Olivier Rouchon Sun PASIG Malta June 25th 2009

Session Two: OAIS Model & Digital Curation Lifecycle Model

Its All About The Metadata

RECOMMENDED FILE FORMATS

The Development of Digital Preservation Best Practices in EPrints. OR2012 : The 7 th International Conference on Open Repositories

Global Digital Format Registry (GDFR) An Interim Status Report

Preservation Regression Tests - Test & Execution - Internal MP3 to WAV

2/24/2015. What are File Formats? Types of File Formats. Sustainable Formats. File formats can be grouped into three categories:

Digging into File Formats: Poking around at data using file, DROID, JHOVE, and more

Scalable, Reliable Marshalling and Organization of Distributed Large Scale Data Onto Enterprise Storage Environments *

Preservation Standards (& Specifications) (&& Best Practices)

DRI: Preservation Planning Case Study Getting Started in Digital Preservation Digital Preservation Coalition November 2013 Dublin, Ireland

Long-term digital preservation of UNSWorks

Digital preservation activities at the German National Library nestor / kopal

Characterisation. Digital Preservation Planning: Principles, Examples and the Future with Planets. July 29 th, 2008

Draft Digital Preservation Policy for IGNCA. Dr. Aditya Tripathi Banaras Hindu University Varanasi

ABBYY FineReader 14 YOUR DOCUMENTS IN ACTION

Digital Preservation in the UK. David Thomas

WePreserve Conference October Nice, France Emulation: Bridging the Past to the Future without Altering the Object

Ex Libris Rosetta A Complete Digital Asset Management and Preservation System

Open Preservation Foundation and The Preservation Action Registry. Martin Wrigley, Executive Director, OPF

Digital Preservation and The Digital Repository Infrastructure

Contents: *** Draft ***

DigiTool for Course Support at Notre Dame. Pascal Calarco, University of Notre Dame IGeLU 2007 Brno, Czech Republic September 3, 2007

Andrea Goethals, Harvard Library ASERL Webinar File Information Tool Set

DupScout DUPLICATE FILES FINDER

A Model for Managing Digital Pictures of the National Archives of Iran Based on the Open Archival Information System Reference Model

Next-Generation Characterization An Update on the JHOVE2 Project

Atari Games - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Current Digital Preservation Trends and SDB4 Features. Pauline Sinclair, Tessella, PASIG, Madrid, 5 July 2010

Ing. José A. Mejía Villar M.Sc. Computing Center of the Alfred Wegener Institute for Polar and Marine Research

Flexible Design for Simple Digital Library Tools and Services

Importance of cultural heritage:

How to Build a Digital Library

Digital The Harold B. Lee Library

Technical product documentation

Co-operative Development of a Long-term Digital Information Archive

What is Islandora? Islandora is an open source digital repository that preserves, manages, and showcases your institution s unique material.

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION: QUALITATIVE

Digital Preservation Efforts at UNLV Libraries

GUIDELINES FOR CREATION AND PRESERVATION OF DIGITAL FILES

IBM Daeja ViewONE Virtual Performance and Scalability

AUSTRIAN STATE RECORDS MANAGEMENT LIFECYCLE

Professional Powerpoint Presentation II

Digital Preservation at NARA

Preservation Metadata Extraction and Collection : Tools and Techniques. Mat Black National Library of New Zealand Te Puna Matauranga o Aotearoa

Protecting Future Access Now Models for Preserving Locally Created Content

Getting Bits off Disks: Using open source tools to stabilize and prepare born-digital materials for long-term preservation

Metadata Ingestion and Processinng

Bringing the Voices of Communities Together:

Lessons Learned. Implementing Rosetta in the Harold B. Lee Library

The digital preservation technological context

DIGITAL ARCHIVES & PRESERVATION SYSTEMS

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION - QUALITATIVE

The Swedish National Archives digital preservation. Mats Berggren, IT-department,

Susan Thomas, Project Manager. An overview of the project. Wellcome Library, 10 October

An Information Storage System for Large-Scale Knowledge Resources

DjVu Technology Primer

Digital Preservation Standards Using ISO for assessment

Digital Preservation Research Initiatives at NLNZ. SAA Research Forum Austin, Texas 11 August 2009

DRS Update. HL Digital Preservation Services & Library Technology Services Created 2/2017, Updated 4/2017

High Definition Experience & Performance Ratings Test. HDXPRT 2012 v1.0 WHITE PAPER

Archival Information Package (AIP) E-ARK AIP version 1.0

Basics in good research data management (RDM) for reviewing DMPs

IBM Proventia Management SiteProtector Installation Guide

The Case of the 35 Gigabyte Digital Record: OCR and Digital Workflows

What do you do when your file formats become obsolete? Lydia T. Motyka Florida Center for Library Automation USETDA 2011

LIT Middleware: Design and Implementation of RFID Middleware based on the EPC Network Architecture

Interoperability & Archives in the European Commission

Building the Archives of the Future: Self-Describing Records

Like It Or Not Web Applications and Mashups Will Be Hot

PERPOS: An Electronic Records Repository and Archival Processing System

<goals> 10/15/11% From production to preservation to access to use: OAIS, TDR, and the FDLP

PASIG Directions & Issues

Introduction to. Digital Curation Workshop. March 14, 2013 SFU Wosk Centre for Dialogue Vancouver, BC

University of British Columbia Library. Persistent Digital Collections Implementation Plan. Final project report Summary version

DOWNLOAD OR READ : WHATS THE DIFFERENCE IN PROTESTANT AND ROMAN CATHOLIC BELIEFS PDF EBOOK EPUB MOBI

ISO PDF/A -Standard Archive file format standard for long-term preservation

Copyright 2008, Paul Conway.

A Case Study of Real-World Porting to the Itanium Platform

BitstreamFormat Conversion Instructions

OAIS: What is it and Where is it Going?

Guide to Supplemental Materials

Trusted Digital Repositories. A systems approach to determining trustworthiness using DRAMBORA

Fusion Registry 9 SDMX Data and Metadata Management System

Using DSpace for Digitized Collections. Lisa Spiro, Marie Wise, Sidney Byrd & Geneva Henry Rice University. Open Repositories 2007 January 23, 2007

bold The requirements for this software are: Software must be able to build, debug, run, and col ect data from Discrete Event Simulation models

Myths and Realities: The Performance Impact of Garbage Collection

Adobe Bundles. Industry-leading Adobe and Macromedia software in one box

NEDLIB LB5648 Mapping Functionality of Off-line Archiving and Provision Systems to OAIS

Transcription:

Performance Study of Digital Object Format Identification & Validation Tools Quyen Nguyen ERA Systems Engineering National Archives & Records Administration

Agenda Background Format Identification Tools Experiments Analysis Related Work Summary 2

OAIS Model for ERA Preservation Planning P R O D U C E R SIP Descriptive Info Ingest AIP Data Management Archival Storage Descriptive Info AIP Access 4-1.3 queries result sets orders DIP C O N S U M E R Administration MANAGEMENT 3

Challenges and Requirements Extensibility Availability Scalability INGEST Preserve Access Security Evolvability 1. Complexity Records in different formats, which may be obsolete. 2. Volume Enormous amounts of records Format Identification Ingest Verification 4

Ingest Process Orchestration Input File Integrity Seal Virus Scan Process Virus Infected File Yes Virus Found No Access Restriction Scan Metadata Collection File Identification & Validation STOP 5

Background Agenda Format Identification Tools Experiments Analysis Related Work Summary 6

File Format Real issue: file extension unreliable to determine the format of a digital object depends on end-user or application. Format identification.. Microsoft Word 2003, Acrobat 8 PDF, etc. Format validation.. Once a format f has been identified for a digital object X, does X really conform to format f.. For example, an XML document may be well-formed or not. ASCII TXT XML VRML BINARY MS WORD TIFF PDF 7

Identification & Validation Tool Several institutions have developed such tools. A tool performs following task: File Input. Find matched signature. Output Metadata: File format: PDF, JPEG, Microsoft Word, etc. Version number of application used to create digital object. Sounds simple yet difficult 8

JHOVE JSTOR/ STOR/Harvard Object Validation Environment, developed by JStor (Journal STORage) and Harvard University Library. Set of modules called handlers, each of which is responsible for a file type Traverse set of handlers until one is found that can positively identify the type of input file. JHOVE can output rich metadata. Technical metadata such as MIX data elements for image files part of output. 9

DROID Digital Record Object Identification developed by United Kingdom National Archives. Based on PRONOM registry of file signatures specific to file types. At runtime, the content of the registry can be downloaded as an XML file, and cached in the DROID process. Traverse signature file containing cached content of PRONOM. DROID process will try to match one by one the signatures in the signature file against the one in the input file 10

JHOVE Screenshot 11

DROID Screenshot 12

Background Agenda Format Identification Tools Experiments Related Work Analysis Summary 13

Experimentation Environment Intel CPU T2500 @ 2.0 GHz, 2.0 GHz, 2.0 GB of RAM. Microsoft Window XP Professional Version 2002 Service Pak 2. Runtime JVM comes with Sun JDK 1.6.0_01-b01. java Xms1024m Xmx1024m Jhove version 1.1 2006-02-13 DROID v1.1 Inserted simple tracing code System.currentTimeMillis() Runtime.totalMemory() Runtime.freeMemory() Metrics Execution Time (ms): time-jhove, time-droid. Heap Size (KB): heap-jhove, heap-droid. 50 measurements per collection or file. Statistical tools: Microsoft Excel and Stats4U. 14

Data Corpus Corpus C1: examples shipped with JHOVE. 112 files whose size ranges from 1 KB to 22 MB most of the files are less than 100 KB. Files are grouped into subdirectories according to their document types: ASCII, GIF, HTML, JPEG, PDF, TIFF, WAV, and XML. HTML subdirectory also contains GIF and JPEG images in the HTML pages. 15

Data Corpus (2) Corpus C2: 24 collections of documents used in NARA research lab. Typical documents coming to the public archives Photos from National Park Service Documents related to Katrina Case files of U.S. District Courts White House press releases, Environmental maps from EPA 1280 files whose sizes range from 1 KB to 136 MB. Notably, document types are more varied. In addition to the types found in C1 set, one can find audio, video clips files, geospatial files, statistical files, etc. 16

Statistical Analysis Perform T-Test Test using Microsoft Excel. Execution Time. Null Hypothesis H0: time-droid = time-jhove. Alternative Hypothesis H1: time-droid > time-jhove. Heap Size Null Hypothesis H0: heap-droid = heap-jhove. Alternative Hypothesis H1: heap -droid < heap -jhove. To conclude H0 with confidence, we want t-stat be small, and P(T<=t) close to 1. Watch for sign of t-stat. 17

Experiment 1 Corpus C1. Execution Time: From T-test, 99% confidence level, time-droid is significantly greater than time-jhove t Stat = 1487.29; P(T<=t) one tail = 9.42E-182 Heap Size: From T-test, 99% confidence level, heap-droid is significantly less than heap-jhove t Stat = -34.50825771 P(T<=t) one tail = 2.4239E-3636 18

Experiment 2 Corpus C2. Execution Time: From T-test, 95% confidence level, time-jhove is significantly greater than time-droid t Stat = -5.48; P(T<=t) one tail = 2.51E-08 DROID generated less heap memory than JHOVE 19

Data Type Impact Two collections -- around 865th and 1000th data points caused a dramatic increase in time-jhove Contain mostly VRML (Virtual Reality Modeling Language) files, which are essentially in ASCII text, but can be interpreted for display. 20

Experiment 3 Tie. Corpus: C2b = C2 {2 VRML collections} From T-test, 95% confidence level, time-droid is significantly greater than time-droid t Stat = 0.057 P(T<=t) two tail = 0.95 No difference on Heap size. 21

Experiment 4 Corpus C2 re-arranged by types and sizes. Use Stat4U 3-way ANOVA with factors: A=Tool; B=Type; C=Size (2 levels only) All 3 factors and interactions are significant with 95% confidence level. Tool factor explains only 0.9 % of the variation. 22

Linear Regression: Size-Time t i m e ( m s ) DROID - Time vs. File Size 3000 y = 0.1325x + 9.524 R 2 = 0.9965 2500 2000 1500 1000 500 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 size (kb) droid-time Linear (droid-time) Corpus C2. Only find linear regression for time- droid: time-droid = 0.13 * size + 9.52 100 TB ~ 5 months. Information for sizing: Computing resources Parallelism 23

Background Agenda Format Identification Tools Experiments Analysis Related Work Summary 24

Analysis Statistically, JHOVE and DROID perform equally well for format types that JHOVE can identify. Qualitatively, JHOVE generated metadata is richer. For types that JHOVE cannot validate, the performance decreases drastically compared to DROID. Easy case: if JHOVE finds that a record is binary, it just responds with a general identification, e.g. ByteStream. But some ASCII cases such as VRML may throw it off. 25

Integrated Approach Two-phase approach for File Identification and Validation: Pass a file through DROID to quickly identify its type. If the type is found to be on the known list of JHOVE, then pass through JHOVE to extract technical metadata. These extracted technical metadata useful for automatic verification purposes. Examples include image resolution, format version numbers, creation dates, font information, etc. 26

Background Agenda Format Identification Tools Experiments Analysis Related Work Summary 27

Related Work GDFR: Global Digital Format Registry Distributed and replicated registry of format information Allow the registration and discovery of digital formats for the long term Collaboration of Harvard University Library and OCLC http://www.gdfr.info FOCUS: Format Curation Service at the University of Maryland Main component is Format Identifier (Fider) Registry Global Digital Format Registry (GDFR) implemented using LDAP https://wiki.umiacs.umd.edu/adapt/index.php/focus:main 28

Related Work (2) PERPOS: Presidential Electronic Records Pilot System at Georgia Technology University software tools to support the OAIS functionalities http://perpos.gtri.gatech.edu Metadata Extract Tool from National Library of New Zealand http://www.natlib.govt.nz/services/get-advice/digital- libraries/metadata-extraction-tooltool AIHT: Automated Preservation Assessment of Heterogeneous Digital Collections bys Stanford University The University of London Computer Centre issued a report to compare DROID, JHOVE, and AIHT: Assessment of File Format Testing Tool. http://www.ulcc.ac.uk/uploads/media/daat_file_format_tools_report.pdf. Very good qualitative and functional analysis Other Projects? 29

Summary File Identification and Validation important step in Ingest Process. Performance study of Jhove and DROID. Optimal approach leveraging both tools. Monitor future progress of other tools. Looking forward to Jhove 2. 30

Thank You http://www.archives.gov/era For any comments or questions, please mailto: quyen.nguyen@nara.gov 31