What do you do when your file formats become obsolete? Lydia T. Motyka Florida Center for Library Automation USETDA 2011
The FCLA, the FDA, and DAITSS FDA: a service of the Florida Center for Library Automation (FCLA) in Gainesville, Florida DAITSS (Dark Archive In The Sunshine State) is the repository software developed by FCLA for the Florida Digital Archive (FDA) as a preservation solution for the State University Libraries of Florida The FDA was the first fully OAIS (Open Archival Information System ISO 1472:2003) conformant repository in production in the United States (2005) The FDA is one of a handful of repositories in the United States to use format migration as a long-term preservation strategy The FDA repository is managed centrally at FCLA, with FDA Affiliates depositing materials to a central repository Automatically archive ETDs submitted to FCLA s ETD service In April 2011 the FDA went into production with version 2 of its DAITSS software, and is preparing for its Open Source Software release for possible use by other repositories
FDA staff: 2 full-time developers currently working on DAITSS 2 software enhancements 1 Formats Specialist/developer: works with all of the major file format-related tools and resources (DROID, JHOVE, UDFR) 1 Archive Manager: duties include troubleshooting, training and documentation for FDA Affiliates, special conversion projects, developing new tools 1 Operations Technician: runs production Services of FCLA IT staff (one primary contact) for OS patches, backups, storage disk management, etc.
FDA: a centralized digital archive serving 10 State University Libraries in Florida UWF FAMU FAU USF UNF FDA FGCU FIU UF UCF FSU
FCLA s ETD Service UWF FAMU FAU USF UNF FDA Via ETD Service FGCU FIU UF UCF FSU
FCLA s ETD Service Catalog record creation Online storage and access Access control for restricted ETDs Optional UMI submission Automatic long-term preservation in the Florida Digital Archive
The ETD Service submission process ETD is FTP-ed to FCLA A copy is processed for display (ETD Service) A copy is sent to the FDA workspace for archiving (FDA)
FDA direct Web Submission via DAITSS GUI
Information Package Bitstream Data File m 1 Intellectual Entity
Information Package Contents METS descriptor (manifest) content file content file content file content file
University of Florida s METS Editor
The DAITSS 2 archiving process 1. The ETD SIP (Submission Information Package) is checked for validity: Completeness: are all described content files included? Correctness: have all content files been correctly transmitted? 2. If the SIP is valid, its individual content files are processed: Content files formats are identified and validated against format standards. Any file format inhibitors and anomalies are noted File format transformation is performed according to an Action Plan for that format (e.g, PDF transformation to PDF/A-1b ) 3. An AIP (Archival Information Package) is created for the ETD and two master copies are stored, one on disk on a remote server in Tallahassee and one on tape in Gainesville, and the FDA s preservation database is populated with extensive metadata about each archived package and file. 4. A report is emailed or ftp-ed to the submitting institution detailing any file format inhibitors and anomalies encountered during archiving
In addition to keeping multiple Archival Information Package masters, the FDA: Extracts and retains extensive technical metadata about each file and its component bitstreams: File format information Significant properties for file transformation Performs file format transformation to ensure long-term preservation Builds provenance history of submitted, normalized and migrated versions of each entity and its components Performs ongoing integrity and fixity checking: Are both master copies still in the repository? (Integrity) Has either copy of the package changed since it was stored? (Fixity)
FDA repository structure Submission Request DAITSS workspace Storage silo Tallahassee database Ingest Storage silo Gainesville
DAITSS 2 architecture
Per file: Validation, Description, Action Plan, Transformation Identification: what format is it? 1,200 recognized formats Approx. 30 supported formats with Action Plans Validation and characterization: well-formed for the identified format valid for the identified format what are its significant properties? Is there an Action Plan for this format? If so, perform file transformation
Background Reports and Action Plans Background report: Format Description Pointer to Specification How to Recognize History and Duration Maintenance Body Platform Support Legal Issues Perceived Popularity Limitations Related Specifications Action Plan: Normalization format(s) Preservation plans: Original format Normalized format Revised on a regular schedule
DAITSS 2 file transformation services Normalization. If a file is in a format considered to be less than optimal for digital preservation a version of the file may be created in a more preservation-worthy format. In general, preferred formats are non-proprietary, well documented, and well understood by FDA staff. (Example: WAV files are normalized to PCM-encoded WAV; all PDFs might be normalized to PDF/A-1b) Migration. If a file is in a format considered at risk of obsolescence, a version may be created in a format considered to be a reasonable successor to the original format. All effort will be made to retain the appearance and behaviors of the original version, although this can not always be guaranteed. Dissemination of archived packages always returns the last, best version of files as well as the original version of all files.
Reports of file format anomalies sent to FDA Affiliates immediately after archiving via ftp or email
Pre-submission use of FDA Description Service
Benefits of file format transformation include: Short-term: Feedback on inhibitors and anomalies allows for file correction Long-term: Files without inhibitors and anomalies are more suitable for longterm preservation Delivery of latest, best copy of file formats Ability to migrate to successor format should the original format be at risk of obsolescence DAITSS Formats Specialist is constantly working to add new Action Plans and update current ones The file format metadata in the FDA s preservation database allows quick identification of files in the repository by format for possible transformation ( refresh )
DAITSS 2 refresh process
DAITSS 2: Open Source Software DAITSS 2 in production at the FDA April 22, 2011 Works well in a consortial environment Currently preparing for OSS release via GitHub Summer intern 2011 project: testing new installation simulating an independent archive using DAITSS 2 software For more information about the possibility of using DAITSS 2 software to create your own digital archive, please contact Manny Rodriguez, avatar38@ufl.edu