Characterisation. Digital Preservation Planning: Principles, Examples and the Future with Planets. July 29 th, 2008

Similar documents
Recording oral histories

RECOMMENDED FILE FORMATS

Different File Types and their Use

MEDIA RELATED FILE TYPES

WakeSpace Digital Archive Policies

Basics in good research data management (RDM) for reviewing DMPs

Website Overview. Your Disclaimer Here. 1 Website Overview

Multimedia. File formats. Image file formats. CSE 190 M (Web Programming) Spring 2008 University of Washington

Client Website Overview Guide

Wealth Management Center Overview Guide

Part III: Survey of Internet technologies

Lecture 19 Media Formats

File Upload extension User Manual

Uploading a File in the Desire2Learn Content Area

Introduction to Content

HTM, HTML, MHT, MHTML Web document Brightspace Learning Environment strips the <title> tag and text within the tag from user created web documents

ExtremeTech Technology News - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Advanced High Graphics

Elementary Computing CSC 100. M. Cheng, Computer Science

National Aeronautics and Space Admin. - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Standard File Formats

How to submit an assignment to Turnitin full student guide

EXCELLENT ACADEMY OF ENGINEERING. Telephone: /

Adding Modules.. 4 Editing a Rich Text Module Publishing a Module Adding Media (Picture, Audio, Video, and PDF) Adding Media from the web (Videos)

Example 1: Denary = 1. Answer: Binary = (1 * 1) = 1. Example 2: Denary = 3. Answer: Binary = (1 * 1) + (2 * 1) = 3

Supported File Types

ednet. smart memory Smart storage expansion for your iphone or ipad

What s New in Studio 6?

Inserting multimedia objects in Dreamweaver

Graphic Communications

The Workers Advisers Office (WAO)

Assigns a persistent identifier that will always point to the object and/or its metadata.

UiT Open Research Dataset Guidelines

The New Document Digital Polymorphic Ubiquitous Actionable Patrick P. Bergmans University of Ghent

Funcom Multiplayer Online Games - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

GUIDELINES FOR CREATION AND PRESERVATION OF DIGITAL FILES

Instruction Manual. idiskk USB Flash Drive 32GB/64GB/128GB

NAS 256 DataSync for Microsoft OneDrive. Backup data from Microsoft OneDrive to your NAS

Multimedia on the Web

OnDemand Discovery Quickstart Guide


3.01C Multimedia Elements and Guidelines Explore multimedia systems, elements and presentations.

Digitisation Standards

Converting.odt (Open Office Document) to (MS Word).doc in console / terminal on Linux and FreeBSD

Genesis Webinar-To-Go Quick Reference Guide

DOWNLOAD OR READ : CONVERTING WORD DOCUMENT TO FORM PDF EBOOK EPUB MOBI

Structure and document your research data

Formatting Support: Word 2008

CR-8710 ios SD Card Reader

Sustainable File Formats for Electronic Records A Guide for Government Agencies

Best Practice: Attaching Files

PhotoFast MemoriesCable U2. Market leading design and technology

HTML5: MULTIMEDIA. Multimedia. Multimedia Formats. Common Video Formats

Standard Exponential Creative Specifications

Atari Games - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

MULTIMEDIA AND CODING

Draft Digital Preservation Policy for IGNCA. Dr. Aditya Tripathi Banaras Hindu University Varanasi

UNDERSTANDING MUSIC & VIDEO FORMATS

Image coding and compression

National Digital Heritage Archive Programme

Publication Quality Graphics

Transform Presentation

Quick start guide to Blackboard at Keele

The content editor has two view modes: simple mode and advanced mode. Change the view in the upper-right corner of the content editor.

STANFORD U.HyperRESEARCH Workshop

Multimedia applications

HTML 5 and CSS 3, Illustrated Complete. Unit K: Incorporating Video and Audio

Metamorphosis IT Document

Content 8.3 to 8.4.x. User Guide Fifth edition, February 22, by Desire2Learn Incorporated. All rights reserved

QuestBase. Create, manage, analyze assessments, tests, quizzes, exams and surveys. Getting Started

Submission Checklist. 1. Have you read the Assignment Brief? 2. Is the file in the correct format? 3. Should you be submitting anonymously?

Computing in the Modern World

DOWNLOAD OR READ : FREE SERVICE MANUAL 2006 GMC SIERRA PDF EBOOK EPUB MOBI

Tablet 300x x x x1024

Assignment Submission - Student Work Product

Print Services User Guide

Final Study Guide Arts & Communications

PASIG Digital Preservation BOOTcamp Best* Practices in Preserving Common Content Types: AudioVisual files

Operator Manual. Version 2.813

CPSC 301: Computing in the Life Sciences Lecture Notes 16: Data Representation

MULTIMEDIA DESIGNING AND AUTHORING

SCHOOL OF DISTANCE EDUCATION UNIVERSITY OF CALICUT SCHOOL OF DISTANCE EDUCATION D M A INTRODUCTION TO MULTIMEDIA QUESTION BANK

VidOvation TV Digital Signage Overview

File Formats for Digital Preservation

Kingston Technology Customization Program

Activity 1: Activity 2: Activity 3:

Instructional Overview. Module 1: Technology Concepts - Part 3: Technology Terms

The M word: What works and what doesn t in file format migration

CTIS 155 Information Technologies I. Chapter 5 Application Software: Tools for Productivity

IDM 221. Web Design I. IDM 221: Web Authoring I 1

ArTe. Letizia Jaccheri Experts in Team 11 th January 2010

My Media Hub Quick Start Guide for USB Devices. Sharing media content with the Fetch Box from a USB device

Camtasia Studio 5.0 PART I. The Basics

ishowdrive (WIB5012) User Manual

Question No: 2 Which part of the structured FrameMaker application controls how long SGML and FrameMaker element names can be by default?

vdx Exponential Creative Specifications

How to apply: The online application process explained step by step

Computing: Digital Media Elements for Applications (SCQF level 5)

BOCC NUT Content Guide

INFSCI 2140 Information Storage and Retrieval Lecture 1: Introduction. INFSCI 2140 and your program

Transcription:

Characterisation Digital Preservation Planning: Principles, Examples and the Future with Planets. July 29 th, 2008 Manfred Thaller Universität zu * Köln manfred.thaller@uni-koeln.de * University at, NOT of Cologne

I What is (in) a format? 2

An image 3

An image 6 rows 5 columns 4

An image 5 rows 6 columns 5

An image 1 == blue 0 == red 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 6

An image 1 == green 0 == yellow 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 7

Store: 1,1,1,1,1,1, 0,0,0,1,1,1, 0,1,1,1,1,0, 1,1,1,1,0,1, 1,1,1,1,1,1 Uncompressed An image 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 8

Store: 6,1,3,0,3,1, 1,0,4,1,1,0, 4,1,1,0,7,1 (Compressed) Run Length Encoded An image 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 9

dimensions An image photogrammetric interpretation compression 10

<basic information> An image <rendering information> <storage information> <data> 11

File format <basic information> What to do? <rendering information> How to do it? <storage information> How to move it from persistent to deployed form? <data> What to deploy? 12

File format <basic information> What to do? <rendering information> How to do it? <storage information> How to move it from persistent to deployed form? <data> What to deploy? 13

<basic information> Mandatory <rendering information> Useful <storage information> Historical <data> Mandatory File format 14

File format A deterministic specification how the properties of a digital object can reversibly be converted into a linear bytestream (bitstream). 15

II Why would we want to know? 16

III Which format to choose? 17

Recommended formats: text High confidence Medium confidence Low confidence Plain text (encoding: ISO8859-1 - 9, UTF-8, UTF-16 with BOM) XML (includes XSD/XSL/XHTML, etc.; with included or accessible schema and character encoding explicitly specified) PDF/A-1 (ISO 19005-1) Cascading Style Sheets (*.css) DTD (*.dtd) PDF (*.pdf) (embedded fonts) Rich Text Format 1.x (*.rtf) HTML 4.x (include a DOCTYPE declaration) SGML (*.sgml) Open Office (*.sxw/*.odt) Office Open XML (*.docx) PDF (*.pdf) (encrypted) Microsoft Word (*.doc) WordPerfect (*.wpd) DVI (*.dvi) All other text formats not listed here http://www.fcla.edu/digitalarchive/pdfs/recformats.pdf 18

Recommended formats: bitmap / raster image High confidence Medium confidence Low confidence TIFF (uncompressed) PNG (*.png) BMP (*.bmp) JPEG/JFIF (*.jpg) JPEG2000 (prefer lossless or uncompressed) (*.jp2) TIFF (compressed) GIF (*.gif) MrSID (*.sid) TIFF (in Planar format) FlashPix (*.fpx) PhotoShop (*.psd) All other raster image formats not listed here http://www.fcla.edu/digitalarchive/pdfs/recformats.pdf 19

Recommended formats: vector graphics High confidence Medium confidence Low confidence SVG 1.1 (no Java binding) (*.svg) Computer Graphic Metafile (CGM, WebCGM) (*.cgm) Encapsulated Postscript (EPS) Macromedia Flash (*.swf) All other vector image formats not listed here http://www.fcla.edu/digitalarchive/pdfs/recformats.pdf 20

Recommended formats: audio High confidence Medium confidence Low confidence AIFF (PCM) (*.aif, *.aiff) WAV (PCM) (*.wav) SUN Audio (uncompressed) (*.au) Standard MIDI (*.mid, *.midi) Ogg Vorbis (*.ogg) Free Lossless Audio Codec (*.flac) Advance Audio Coding (*.mp4, *.m4a, *.aac) MP3 (MPEG-1/2, Layer 3)(*.mp3) http://www.fcla.edu/digitalarchive/pdfs/recformats.pdf AIFC (compressed) (*.aifc) NeXT SND (*.snd) RealNetworks 'Real Audio (*.ra, *.rm, *.ram) Windows Media Audio (*.wma) WAV (compressed) (*.wav) All other audio formats not listed here 21

Recommended formats: video High confidence Medium confidence Low confidence Motion JPEG 2000 (ISO/IEC 15444-4)( *.mj2) AVI (uncompressed) (*.avi) QuickTime Movie (uncompressed)(*.mov) Motion JPEG (*.avi, *.mov) Ogg Theora (*.ogg) MPEG-1, MPEG-2 (*.mpg, *.mpeg) MPEG-4(*.mp4) AVI (compressed) (*.avi) QuickTime Movie (compressed) (*.mov) RealNetworks 'Real Video (*.rv) Windows Media Video (*.wmv) All other video formats not listed here http://www.fcla.edu/digitalarchive/pdfs/recformats.pdf 22

Recommended formats: data base High confidence Medium confidence Low confidence Delimited Text (*.txt, *.csv) SQL DDL DBF (*.dbf) OpenOffice *.sxc/*.ods) Office Open XML *.xlsx) Excel (*.xls) All other spreadsheet/ database formats not listed here http://www.fcla.edu/digitalarchive/pdfs/recformats.pdf 23

Recommended formats: 3D ( virtual reality ) High confidence Medium confidence Low confidence X3D (*.x3d) VRML (*.wrl, *.vrml) U3D (Universal 3D file format) All other virtual reality formats not listed here http://www.fcla.edu/digitalarchive/pdfs/recformats.pdf 24

Doctoral thesis on robustness of file formats: Volker Heydegger, University at Cologne. herrmanv@uni-koeln.de 25

IV How to we identify a format? 26

What kind of file is this? Two ways to identify a file: (a)by extension. Each file ending with *.doc is a MS Word document 27

What kind of file is this? Two ways to identify a file: (b) By internal characteristics ( magic number, signature ). A TIFF file begins with Bytes 0-1: The byte order used within the file. Legal values are: II (4949.H) / MM (4D4D.H) Bytes 2-3 An arbitrary but carefully chosen number (42) that further identifies the file as a TIFF file. 28

File format registries - URLs PRONOM: http://www.nationalarchives.gov.uk/pronom/ (does not only rely on extensions) Global Digital Format Registry: http://hul.harvard.edu/gdfr (predominantly project description) FileExt: http://filext.com (predominantly links to software) 29

V What s a file characteristic, than? 30

Technical metadata A high proportion of the preservation metadata will be in narrative format and will require manual entry by Library staff. A significant subset of the data however, relating to technical file characteristics, can be automatically extracted from the digital object by reading the file header details. This successful extraction of preservation metadata has been proved in a previous National Library proof of concept project. The automated capture of this information will significantly reduce the amount of manual data entry required from Library staff. file characteristics. 31

Why automate? 1 million objects: use one second for each. == 16666.7 minutes == 277.8 hours == 11.57 working days of a computer == 34.7 8-hour days for a Human == 7 working weeks 32

Why automate? 1 million objects: use five minutes for each. == 416 666.7 hours == 52 803.4 8-hour days for a Human == way too much for anything 33

Formats in PLANETS: File characteristics Based on two formal languages: (1)eXtensible Characterisation Extraction Language (= XCEL) (2)eXtensible Characterisation Description Language (= XCDL) 34

The comparator tiff Extractor tiff XCDL Migrator 93% Comparator png tiff XCEL png XCEL png XCDL 35

The comparator Extractor Appropriate XCELs Comparator C-Set 36

Why data? Photoshop Photoshop Becomes discoverable only from the actual data 37

V What is not in a file format? 38

Testfile in Word 2007 39

Testfile in Word 2003 (2007) 40

Testfile in Open Office ODT 41

Testfile in PDF 42

Measuring the pages Cut out page from rendering surface. Scale to common dimensions: 371 +/- 1 x 521 +/- 1 Measure 1. The leftmost and lowest completely black pixel in the letter A starting the first line of the main text. 2. The leftmost and highest completely black pixel in the letter E starting the first line of the text in the footnote. 3. The geometrical centre of the period at the end of the main sentence. 4. The geometrical centre of the period at the end of the footnote text. 43

Measuring Word 2003 (i) = 45 / 134; (ii) = 57 / 470; (iii) = 215 / 322 ; (iv) = 254 / 483 44

Measuring Word 2007 (i) = 45 / 134; (ii) = 57 / 470; (iii) = 215 / 322 ; (iv) = 254 / 483 45

Open Office ODT (i) = 44 / 132; (ii) = 52 / 469; (iii) = 214 / 320 ; (iv) = 247 / 482 46

PDF (i) = 45 / 130; (ii) = 59 / 467; (iii) = 215 / 317 ; (iv) = 254 / 480 47

Summary I The comparison of the four renderings of the example pages described above seem to indicate clearly, that a migration from the Word family of formats to PDF is a better way to preserve the content of the document, than a migration to the Open Office format. 48

Measuring Word 2003 Relationship tagged explicitly. Text / footnote separation clear. Rendering / layout not (totally) predicatble. Footnote indicator unpredictable. 49

Measuring Word 2007 Relationship tagged explicitly. Text / footnote separation extremely clear. Rendering / layout pretty predictable. Footnote indicator not predictable. 50

Open Office ODT Relationship tagged explicitly. Text / footnote separation extremely clear. Rendering / layout a little bit predictable. Footnote indicator predictable. 51

PDF Relationship expressed by layout. Text / footnote separation missing. Rendering / layout very much predictable. Footnote indicator predictable. 52

Summary II The comparison of the four internal structures of the example pages described above seem to indicate clearly, that a migration from the Word family of formats to PDF is a worse way to preserve the content of the document, than a migration to the Open Office format. 53

Small technical note Do not forget, that the whole movement started by SGML, carried into the WWW by HTML, transferred to content by the TEI and started XML as a basic empowering technology...... assumes that rendering is NOT particularly relevant. 54

Proposal <significantpoints> <point x= 45 y= 134 /> <point x= 57 y= 470 /> <point x= 215 y= 322 /> <point x= 254 y= 483 /> </significantpoints> 55