Preserving PDF at the coalface

Similar documents
File obsolescence at the ADS?

Workshop Background. Purpose. Context. To provide you with resources and tools to help you know how to handle file format decisions as a researcher.

Scanshare Sales Guide V1.2

Review PDF Converter for Windows software downloader cnet ]

PDF/A - The Basics. From the Understanding PDF White Papers PDF Tools AG

Preparing PDF Files for ALSTAR

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION: QUALITATIVE

Creating PDF/A using Office Apps which Include 1- button PDF Creators

DOWNLOAD OR READ : CONVERTING WORD DOCUMENT TO FORM PDF EBOOK EPUB MOBI

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION - QUALITATIVE

DOCUMENT NAVIGATOR SALES GUIDE ADD NAME. KONICA MINOLTA Document Navigator Sales Guide

Features & Functionalities

The Future of PDF/A and Validation

PDF/A Competence Center Webinars

CONVERT EXCEL DOCUMENT INTO

Spreadsheet Procedures

DOWNLOAD OR READ : WORD AND IMAGE IN ARTHURIAN LITERATURE PDF EBOOK EPUB MOBI

DOWNLOAD OR READ : FREE SERVICE MANUAL 2006 GMC SIERRA PDF EBOOK EPUB MOBI

DOWNLOAD OR READ : WHEN YOU ARE CONVERTED PDF EBOOK EPUB MOBI

ABBYY FineReader 14. User s Guide ABBYY Production LLC. All rights reserved.

DOWNLOAD OR READ : WORD 6 FOR WINDOWS ESSENTIALS PDF EBOOK EPUB MOBI

Graham Taylor.

DOWNLOAD OR READ : WORD 10 FOR MAC OS X VISUAL QUICKSTART GUIDES PDF EBOOK EPUB MOBI

Characterisation. Digital Preservation Planning: Principles, Examples and the Future with Planets. July 29 th, 2008

Features & Functionalities

ABBYY FineReader 14 YOUR DOCUMENTS IN ACTION

ELIMINATING ZERO-DAY MALWARE ATTACKS IN DOCUMENTS DO NOT ASSUME OPENING A NORMAL BUSINESS DOCUMENT IS RISK FREE

verapdf after PREFORMA

Cloned page. A Technical Introduction to PDF/UA. DEFWhitepaper. The PDF/UA Standard for Universal Accessibility

Sustainable File Formats for Electronic Records A Guide for Government Agencies

Create PDF s. Create PDF s 1 Technology Training Center Colorado State University

DOWNLOAD OR READ : TO BE FREE PDF EBOOK EPUB MOBI

XF Rendering Server 2008

MAKING FILES ACCESIBLE WITH ADOBE ACROBAT PRO. available to create PDFs, not just read them, and is the focus of this Quick Start Guide.

How. Can Acrobat Help My Bar Association? Catherine Sanders Reach ABA Legal Technology Resource Center

DOWNLOAD OR READ : WHATS THE DIFFERENCE IN PROTESTANT AND ROMAN CATHOLIC BELIEFS PDF EBOOK EPUB MOBI

Records management workflows

The Development of Digital Preservation Best Practices in EPrints. OR2012 : The 7 th International Conference on Open Repositories

Using Save As to Conform to PDF/A

Susan Thomas, Project Manager. An overview of the project. Wellcome Library, 10 October

ICH M8 Expert Working Group. Specification for Submission Formats for ectd v1.1

DOWNLOAD OR READ : THE WORD AND THE SPIRIT PDF EBOOK EPUB MOBI

Introducing PDF/UA. The new International Standard for Accessible PDF Technology. Solving PDF Accessibility Problems

CutePDF - Convert to PDF for free, Free PDF Utilities with a free trial version software for 3 weeks low risk middle return is invested by.

DOWNLOAD OR READ : THE WORD ON THE STREET IS HEARD BY ALL BUT ONLY A FEW LISTEN PDF EBOOK EPUB MOBI

Long-term digital preservation of UNSWorks

DOWNLOAD OR READ : WORD FROM THE TOP PDF EBOOK EPUB MOBI

Session Two: OAIS Model & Digital Curation Lifecycle Model

GUIDELINES FOR CREATION AND PRESERVATION OF DIGITAL FILES

ISO INTERNATIONAL STANDARD. Document management Engineering document format using PDF Part 1: Use of PDF 1.6 (PDF/E-1)

compart PDF/A-Support in Compart Products PDF/A White Paper White Paper May 2006

Swedish National Data Service, SND Checklist Data Management Plan Checklist for Data Management Plan

MEDIA RELATED FILE TYPES

The OAIS Reference Model: current implementations

Learn Html Pdf Converter To Excel Software Full Version

ABBYY FineReader 14 Full Feature List

Supported File Types

DRS Update. HL Digital Preservation Services & Library Technology Services Created 2/2017, Updated 4/2017

PRIMAVERA CONTRACT MANAGEMENT (PCM) CLOSEOUT PROGRAMMING SOLUTION

Overview. Finding information and help Adobe Acrobat. Where to find it and why to use it. When converting from Word to Acrobat

What do you do when your file formats become obsolete? Lydia T. Motyka Florida Center for Library Automation USETDA 2011

Implementing a Standardized PDF/A Document Storage System with LEADTOOLS

DOWNLOAD OR READ : THE WORD IN THE WORLD PDF EBOOK EPUB MOBI

Draft Digital Preservation Policy for IGNCA. Dr. Aditya Tripathi Banaras Hindu University Varanasi

DOWNLOAD OR READ : THE WORD OF A LIAR PDF EBOOK EPUB MOBI

Adobe Acrobat 8 Professional - Available November 8, 2006 Communicate and Collaborate with the Essential PDF Solution

DOWNLOAD OR READ : THE WORD OF A PRINCE LIFE OF ELIZABETH I FROM CONTEMPORARY DOCUMENTS PDF EBOOK EPUB MOBI

Application form guidance

PDF/arkivering PDF/A. Per Haslev Adobe Systems Danmark Adobe Systems Incorporated. All Rights Reserved.

I.R.I.S. is shipping IRISPdf Server 6.0, including ihqc, I.R.I.S. new & revolutionary high quality compression technology!

ISO PDF/A -Standard Archive file format standard for long-term preservation

DOWNLOAD OR READ : THE WORD ON THE YARD THE PONY WHISPERER 1 PDF EBOOK EPUB MOBI

DOWNLOAD OR READ : THE LATEST WORD OF UNIVERSALISM PDF EBOOK EPUB MOBI

Portal Subcommittee Agenda Thursday, May 10, :45-11:45. Criminal Case Initiation Workgroup Update Judge Martin Bidwill

PEERNET File Conversion Center

4. TECHNOLOGICAL DECISIONS

PDF/A in the Product Lifecycle

Data Structures in Archaeology - Information for the Future

DOCUMENTATION INGEST PROCESSING PROCEDURES

DOWNLOAD OR READ : USING WORDPERFECT 8 FOR LINUX SPECIAL EDITION SPECIAL EDITION USING PDF EBOOK EPUB MOBI

Managing Records in Electronic Formats. An Introduction

Sparta Systems TrackWise Solution

Preparating data files for printing

DOWNLOAD OR READ : THE IMAGE OF THE POPULAR FRONT PDF EBOOK EPUB MOBI

OnDemand Discovery Quickstart Guide

ADOBE 9A Adobe LiveCycle ES Application Developer.

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

Using Singleview with Office 2007/2010

Learn Html Pdf Convert To Word Full Version For Windows 7

SVG and the Preservation of Vector Images

Go From Paper to Automated PDFs and Purpose-built Models

The Choice For A Long Term Digital Preservation System or why the IISH favored Archivematica

Working with PDF and PDF/X Technology. This article is supported by...

PDF/A for Scanned Documents

An overview of the OAIS and Representation Information

ScholarOne Manuscripts. Author File Upload Guide

Document Version: 1.0. Purpose: This document provides an overview of IBM Clinical Development v released by the IBM Corporation.

The Making of PDF/A. 1st Intl. PDF/A Conference, Amsterdam Stephen P. Levenson. United States Federal Judiciary Washington DC USA

Agenda. Bibliography

Taming the Wild PDF within the In-Plant Printer

Transcription:

Preserving PDF at the coalface PDF/A at the Archaeology Data Service Tim Evans 15-07-2015

Introduction The Archaeology Data Service: Established in 1996 Based within the Department of Archaeology, University of York Digital archive for UK-based fieldwork and research in archaeology Academic research archives produced by Higher Education The results of development-led fieldwork

Is destructive! What was the impetus in 1996? Resources are unique The preserved record becomes the main resource for future interpretation

Progress to-date Collections o1,100,000 metadata records o700+ rich archives o30,000+ reports

The problem Traditionally archives deposited as part of a formal process: Negotiation and enforcement of accepted formats Documents normally deposited as DOC, DOCX, RTF, ODT etc

The problem Documents often comprise text and raster/vector elements created in a range of Softwares e.g. Adobe Illustrator ADS recommended different elements deposited as separate files for preservation Text as DOC etc Raster at TIF etc Vector as DXF A PDF (usually 1.4) created for dissemination

The problem On occasion a PDF was the only format that could be deposited Lack of technical expertise Departure of staff member Apathy!

Preservation Headache! Copying and pasting into Word! Saving in alternative formats XML Last resort saving each page as a TIF The problem Not ideal: Down-sampling of embedded images Lack of OCR Extra files and management for any future migration

Things get worse Only dealing with academic archives Negotiation was controlled (mostly!) Relatively small volume of PDFs deposited In the early 2000s the ADS (and partners in national and local government) launched the OASIS system

What is OASIS? Online system for recording investigations Aimed at development led works undertaken as mitigation in the planning process Over 4000+ events every year in England alone Each event has its own report, submitted to the local authority as part of preservation by record. http://dx.doi.org/10.3789/isqv25no3.2013.04

What is OASIS?

Offers a compact package Free to view The rise of the PDF Perceived as static content and layout will remain the same Traditionally stored on a Local Authority server or even printed out!

ai cdr csv doc dwg dxf jpg odt pdf png rtf tif xls xlsx 27802 Files accessioned through OASIS March 2008 July 2015

Can t you ask for non-pdf? Reports are policed by the relevant curator in Local Government, not the ADS These curators (understandably) prioritise getting the report and validity of content Local Government under severe financial pressures more important things to worry about! Besides, the PDF offers a simple solution for all these parties, why not use it?

Archival solutions (2008) PDF/A-1 aims to preserve the static visual Option 1: save everything as TIF appearance of electronic documents over time and also Option aims to 2: support do nothing future access and future migration needs by providing frameworks for embedding metadata about electronic documents, and defining the logical structure and semantic properties of electronic documents PDF/A standard Option3: Look at migrating all the PDF files we receive into the new PDF/A 1 standard

All conversions using Adobe Acrobat 8 - Pro 9 and 10 Manually using the Preflight tool At the beginning Almost impossible to migrate an archaeological report to the 1A standard PDF/A 1B was thus used as the default option

PDF to PDF/A 1B conversions were problematic, common issues included: Issues with XMP properties Glyphs Initial problems Fonts have not been embedded.

A lot of these can be fixed with manual intervention Dreaded Print to PDF/A 1B option Uncertainty as to what we re creating? Are we damaging images, losing text? However A lot of time spent in manual checks of PDF/A files to ensure significant properties retained

PDFTron Since mid-2011 the ADS utilised PDF/A Manager created by PDFTron batch-level processing; automation of previously manual fix-ups (such as the editing/stripping of XMP properties) Success rate i.e. the PDFTron conversion script returning a successful post-conversion validation is on average around 80% of all files processed (2013)

However PDFTron conversion rate was notably less successful when dealing with the growing numbers of PDF 1.5-1.7 (2015) Although a useful tool when dealing with a large corpus of mixed PDF formats, has not been the overarching solution Still using a mixture of PDFTron and Preflight

PDFTron PDF/A 1B files checked in Preflight (Pro X) On average between 5-10% have failed to conform to the profile However

Uncertainty over the PDF/A profile We returned to files created in Acrobat 8: significant numbers (c. 25%) of these were not validating as such in the Preflight tool for either Pro 9 or 10 a very small number created in Acrobat 9.5.5 that were not validating in Pro 10 On closer investigation it seems these issues have not been confined to the ADS experience. The Bavaria report identifies the same key problems, and attribute them to the fine-tuning of the PDF/A-1 standard throughout versions 7-9 of Adobe Software, as well as third party tools.

100% 90% 80% 70% 60% 50% 40% 30% 20% Microsoft Word Document Acrobat PDF/A Acrobat PDF 1.7 Acrobat PDF 1.6 Acrobat PDF 1.5 Acrobat PDF 1.4 Acrobat PDF 1.3 Acrobat PDF 1.2 Acrobat PDF 1.0 10% 0% 2008 2009 2010 2011 2012 2013 2014 2015

DROID (The National Archives) run on ingest Detects file type <pdfaid:part> namespace declaration in the PDF/A Identification ON subsequent checking (Preflight) most of these failing: false positives Much alarm! Identification of incoming PDF/A

CutePDF OmniPage Capture SDK Nitro PDF PDFCreator Acrobat PDFMaker for Word (v.8) Third party tools = the majority of PDF/A files we were receiving from depositors were not validating

Moving forward Open Planets Foundation (OPF) - PDF Identify, validate, repair. Hamburg 1-2 Sept 2014 Some errors in validation would not cause significant preservation risks in the future PDF/A 2: (ISO 19005-2). PDF/A 1 is too restrictive and doesn t deal well with aspects of PDF which have become available since PDF1.4

100% 90% 80% 70% 60% 50% 40% 30% 20% Microsoft Word Document Acrobat PDF/A Acrobat PDF 1.7 Acrobat PDF 1.6 Acrobat PDF 1.5 Acrobat PDF 1.4 Acrobat PDF 1.3 Acrobat PDF 1.2 Acrobat PDF 1.0 10% 0% 2008 2009 2010 2011 2012 2013 2014 2015

Use of PDF/A 1 and PDF/A 2 Current Practice Notable that some files will only convert to A1 not A2 (in Adobe) Adopting a best-fit policy using the two standards (yet to get to grips with A3)

ADS use of Callas PDF Toolbox Same validation tool as Adobe Acrobat X Pro and thus consistent validation Significant advantages in reporting Batch conversion Callas

Use of PDF/A 2 From Adobe Preflight Syntax problem: Real value out of range (too high) (also commonly get the other extreme - too low) Solution to this issue was to use PDFA2 as the original PDFA1b spec is unable to cope with large numbers either negative or positive. The error in this case was caused by a bezier curve with a high value.

Use of PDF/A 2 Preflight turned the pages into raster images and putting the text as an invisible layer so it was still searchable. This was not an ideal solution as it lost the vector aspects of some of the illustrations (archaeological site plans). As the vector aspects of these illustrations is often seen as a significant property of the file it is not adequate solution for this use case

Common problems Still an issue with Fonts: propriety font or unusual font size e.g. 11.00343pt Workaround for all softwares (inc. Callas) is to rasterize and image A solution, but not always best quality

Thoughts Do we (ADS) need to be less fussy about the significant properties for reports Does vector content always need to be retained? Should we worry about rasterizing pages (as long as OCR is present)? Now tied to a mixed economy of softwares and tools (some free, some commercial) to ensure consistent and accurate creation and validation.

Thanks for listening! tim.evans@york.ac.uk