File Format Considerations in the Preservation of e- Books

Similar documents
Preservation Standards (& Specifications) (&& Best Practices)

DOWNLOAD PDF WHAT IS OPEN EBOOK

Addressing the E-Journal Preservation Conundrum: Understanding Portico

Protecting Future Access Now Models for Preserving Locally Created Content

Digital Preservation DMFUG 2017

ebook Production Jumpstart

Metadata and Encoding Standards for Digital Initiatives: An Introduction

Amazon Kindle Documentation Pdf Format Conversion

Hands on with EPUB (workshop)

epub Finishing Touches

EPUB in the Wild. Liz

Getting started: find out what s in the IMF elibrary

ScholarOne Manuscripts. Author File Upload Guide

Introduc)on to EPUB 3. Bill McCoy Execu)ve Director, IDPF November 30, 2013

Building a Linked Open Data Knowledge Graph Henning Schoenenberger Michele Pasin. Frankfurt Book Fair 2017 October 11, 2017

ILLUSTRATED CONTENT FOR E-READING MELISSA SERDINSKY VP, MANUFACTURING & DIGITAL OPERATIONS THE PERSEUS BOOKS GROUP

DRI: Preservation Planning Case Study Getting Started in Digital Preservation Digital Preservation Coalition November 2013 Dublin, Ireland

RUtgers COmmunity REpository (RUcore)

Building EZ Picture Books for the Kindle. Copyright 2013 by Rob Smith All Right Reserved

ISO Self-Assessment at the British Library. Caylin Smith Repository

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

DOWNLOAD OR READ : WHATS THE DIFFERENCE IN PROTESTANT AND ROMAN CATHOLIC BELIEFS PDF EBOOK EPUB MOBI

Digits Fugit or. Preserving Digital Materials Long Term. Chris Erickson - Brigham Young University

Reality of EPUB: Making it work for your documentation (workshop)

Optional Thesis Deposit

Uploading Files to Project MUSE May 2018

Copyright 2008, Paul Conway.

The Adobe XML Architecture

Assigns a persistent identifier that will always point to the object and/or its metadata.

Content Submission Guidelines

Publishing Technology 101 A Journal Publishing Primer. Mike Hepp Director, Technology Strategy Dartmouth Journal Services

A Case Study Webinar: How Wiley-Blackwell Accelerated Digital Production by 75% webinar. aptaracorp.com

EPUB 3 An Insider's Look: How Will Your ebook Operations be Affected? webinar. aptaracorp.com

ABBYY FineReader 14. User s Guide ABBYY Production LLC. All rights reserved.

EMC ApplicationXtender Web Access

Introduction to Digital Preservation. Danielle Mericle University of Oregon

Elba Project. Procedures and general norms used in the edition of the electronic book and in its storage in the digital library

01.

How to Build a Digital Library

3. Technical and administrative metadata standards. Metadata Standards and Applications

Draft Digital Preservation Policy for IGNCA. Dr. Aditya Tripathi Banaras Hindu University Varanasi

Creating Compound Objects (Documents, Monographs Postcards, and Picture Cubes)

Invitation to Tender Content Management System Upgrade

Stelae Technologies ... Data Conversion a Walkthrough. «Extracting the Intelligence from Content»

Institutional Repository using DSpace. Yatrik Patel Scientist D (CS)

How to Export Your Book as epub and Mobi file formats with Microsoft Word and Calibre

Usability and accessibility in EPUB 3. and in the future of e-learning. SMART on ICT 2012 International Open Forum

Learn Html Pdf Converter Software Windows Xp With Key

Agenda. Bibliography

3D Digital Textbooks. Web3D Standards Meeting, SIGGRAPH Kwan-Hee Yoo Chungbuk National Univeristy, Korea

Developing a Research Data Policy

Manually Transfer Books To Kindle Fire Hd From

Persistent identifiers, long-term access and the DiVA preservation strategy

EMC ApplicationXtender Web Access

epub v3: Opportunities & Challenges for Digital Publishing

Conducting a Self-Assessment of a Long-Term Archive for Interdisciplinary Scientific Data as a Trustworthy Digital Repository

The Case of the 35 Gigabyte Digital Record: OCR and Digital Workflows

Applying Archival Science to Digital Curation: Advocacy for the Archivist s Role in Implementing and Managing Trusted Digital Repositories

Data Curation Handbook Steps

Table of contents. DMXzone epub Manager DMXzone

CIT BY: HEIDI SPACKMAN

The e-depot in practice. Barbara Sierman Digital Preservation Officer Madrid,

What is Islandora? Islandora is an open source digital repository that preserves, manages, and showcases your institution s unique material.

SharePoint Archival Storage Strategies & Technologies January Porter-Roth Associates 1

Certification Efforts at Nestor Working Group and cooperation with Certification Efforts at RLG/OCLC to become an international ISO standard

Web-based workflow software to support book digitization and dissemination. The Mounting Books project

Chapter 5: The DAITSS Archiving Process

Archival Information Package (AIP) E-ARK AIP version 1.0

AlphaTrust PRONTO - Transaction Processing Overview

DRS Update. HL Digital Preservation Services & Library Technology Services Created 2/2017, Updated 4/2017

Importance of cultural heritage:

INFOhio Symphony Handbook ebooks in ISearch. ISearch ebook Connector

Reading books on an ipad or other electronic device is a

DOWNLOAD OR READ : XML MADE SIMPLE PDF EBOOK EPUB MOBI

HTML5 HTML & Fut ure o Web M edi dia Streami a est Work h op, ov 2010 Michael Dale Zohar Babin eve oper o Dev R l e t a i tions & C

PRODUCTION METRICS AND METADATA

This document describes the features supported by the new PDF emitter in BIRT 2.0.

Content Manager. Software Version 9.3. Release Notes

The Development of Digital Preservation Best Practices in EPrints. OR2012 : The 7 th International Conference on Open Repositories

Teaching with the ipad

The Next Step. DPS Adobe Digital Publishing Suite. Apple cofounder Steve Jobs stated in a 1983 speech

Export out report results in multiple formats like PDF, Excel, Print, , etc.

From Individual Solutions to Generic Tools Digitization at the Max Planck Society. Digitization Day 2012, Geneva Andrea Kulas

N. Brownlee Independent Submissions Editor Expires: April 21, 2013 October 18, 2012

Electronic Thesis Submission Manual. Intended for Graduate Students

PRODUCT PDF PRINT - Magento2 USER MANUAL MAGEDELIGHT.COM SUPPORT E: P: +1-(248)

Building Responsive Websites

Digital Preservation in the Cloud Benefits and Considerations for State Archives Tuesday 10 Feb 2015 Preservica & Amazon Web Services

The Swedish National Archives digital preservation. Mats Berggren, IT-department,

RLG Model Request for Information (RFI) for Digital Imaging Services

Assessment of product against OAIS compliance requirements

Questionnaire for effective exchange of metadata current status of publishing houses

IMPLICIT RELIGION Guidelines for Contributors March 2007

Scalable, Reliable Marshalling and Organization of Distributed Large Scale Data Onto Enterprise Storage Environments *

NEDLIB LB5648 Mapping Functionality of Off-line Archiving and Provision Systems to OAIS

Final Report. Phase 2. Virtual Regional Dissertation & Thesis Archive. August 31, Texas Center Research Fellows Grant Program

Creating a System for the Online Delivery of Oral History Content

W3C DPIG Charting the Path Forward for the Future of Publishing with the Open Web Platform

The OAIS Reference Model: current implementations

The Making of PDF/A. 1st Intl. PDF/A Conference, Amsterdam Stephen P. Levenson. United States Federal Judiciary Washington DC USA

Transcription:

File Format Considerations in the Preservation of e- Books Sheila Morrissey Senior Research Developer, Portico NISO Webinar: Heritage Lost? Ensuring the Preservation of E-books May 23, 1012

Portico - Third Party Preservation Portico is among the largest community-supported digital archives in the world. Working with libraries, publishers, and funders, we preserve e- journals, e-books, and other electronic scholarly content to ensure researchers and students will have access to it in the future. 2

Portico - Participating Content Over 2,000 societies, and associations have committed content to Portico through 147 publishers agreements. Committed Content» E-journal titles 13,675» E-book titles 129,781» D-collections 46 3

Portico Preserved Content Preserved Content» E-journal titles 9,568» E-book titles 16,861» D-collections 12» Archival Units 19,433,869» Preserved Files 319,737,011 4

Portico - Audit and Certification In 2010, Portico became the first digital preservation service to be independently audited by the Center for Research Libraries (CRL) and subsequently certified as a trusted, reliable digital preservation solution that serves the needs of the library community. 5

Portico - History 2002 Launch of Electronic Archiving Initiative by JSTOR 2006 Portico ingests initial e- journal content into the archive 2009 Portico ingests initial e- book content into the archive 2009 CRL audit of Portico begins 2005 Portico Launched 2007 Portico makes first trigger title available 2009 Portico fulfills first PCA claim 2010 Portico ingests initial d- collection content 6

Digital Preservation Digital preservation is the series of management policies and activities necessary to ensure the enduring usability, authenticity, discoverability, and accessibility of content over the very long-term. The key goals of digital preservation include: Usability Authenticity Discoverability Accessibility the intellectual content of the item must remain usable via the delivery mechanism of current technology the provenance of the content must be proven and the content an authentic replica of the original the content must have logical bibliographic metadata so that it can be found by end users through time the content must be available for use to the appropriate community 7

Preservation: Legal aspects Legal right to preserve content» Not always the same as access rights» Specified in contracts» Includes embedded or supplemental files, such as images» DRM removed 8

Usability - Preserve Intellectual Content 9

Usability - Preserve Intellectual Content 10

Usability: Rendition and Delivery Content is rendered to support current delivery platform, i.e. web browser.? rendered & delivered Rendition engine can be modified to meet new technology requirements. 11

Portico Another Look at the History 2002 Launch of Electronic Archiving Initiative by JSTOR 2006 Portico ingests initial e- journal content into the archive 2009 Portico ingests initial e- book content Kindle 2 Nook 2011 ipad 2 Kindle Fire Nook Simple Touch epub3 2005 Portico Launched 2007 Portico makes first trigger title available 2010 ipad 1 Nook Color 2012 Portico ingests initial d- collection content ipad 3 iphone Kindle 1 12

Usability: Anticipated usage 13

Usability: and new usage 14

Authenticity, Discoverability: Preservation Context Preservation and Packaging Metadata File 15

Context The intellectual unit represented by this metadata file is a digitized book. It was scanned by Joe on this date. It was ingested into the repository on this other date. Jane Smith granted us preservation rights to it on this other date.... Preservation and Packaging Metadata File 16

Context The intellectual unit represented by this metadata file is a digitized book. It was scanned by Joe on this date. It was ingested into the repository on this other date. Jane Smith granted us preservation rights to it on this other date.... These TIF files are page images. The TIF file named XYZ is page 1. It is a valid TIF and has a checksum of 123456. The TIF file named ABC is page 2. It is not a valid TIF and has a checksum of 78910.... Preservation and Packaging Metadata File 17

Context The intellectual unit represented by this metadata file is a digitized book. It was scanned by Joe on this date. It was ingested into the repository on this other date. Jane Smith granted us preservation rights to it on this other date.... These TIF files are page images. The TIF file named XYZ is page 1. It is a valid TIF and has a checksum of 123456. The TIF file named ABC is page 2. It is not a valid TIF and has a checksum of 78910.... Preservation and Packaging Metadata File These JPG files are figures. The JPG file named MNO is the 2 nd figure on page 2. It is a valid JPG and has a checksum of 234567.... 18

Context The intellectual unit represented by this metadata file is a digitized book. It was scanned by Joe on this date. It was ingested into the repository on this other date. Jane Smith granted us preservation rights to it on this other date.... These TIF files are page images. The TIF file named XYZ is page 1. It is a valid TIF and has a checksum of 123456. The TIF file named ABC is page 2. It is not a valid TIF and has a checksum of 78910.... Preservation and Packaging Metadata File These JPG files are figures. The JPG file named MNO is the 2 nd figure on page 2. It is a valid JPG and has a checksum of 234567.... This PDF file contains page images. The page images are built from TIF files XYZ, ABC, etc. and JPG figure graphics MNO, etc.... 19

Context The intellectual unit represented by this metadata file is a digitized book. It was scanned by Joe on this date. It was ingested into the repository on this other date. Jane Smith granted us preservation rights to it on this other date.... These TIF files are page images. The TIF file named XYZ is page 1. It is a valid TIF and has a checksum of 123456. The TIF file named ABC is page 2. It is not a valid TIF and has a checksum of 78910.... Preservation and Packaging Metadata File These JPG files are figures. The JPG file named MNO is the 2 nd figure on page 2. It is a valid JPG and has a checksum of 234567.... This PDF file contains page images. The page images are built from TIF files XYZ, ABC, etc. and JPG figure graphics MNO, etc.... This MARC file is the bibliographic record for the book.... 20

Context The intellectual unit represented by this metadata file is a digitized book. It was scanned by Joe on this date. It was ingested into the repository on this other date. Jane Smith granted us preservation rights to it on this other date.... These TIF files are page images. The TIF file named XYZ is page 1. It is a valid TIF and has a checksum of 123456. The TIF file named ABC is page 2. It is not a valid TIF and has a checksum of 78910.... Preservation and Packaging Metadata File These JPG files are figures. The JPG file named MNO is the 2 nd figure on page 2. It is a valid JPG and has a checksum of 234567.... This PDF file contains page images. The page images are built from TIF files XYZ, ABC, etc. and JPG figure graphics MNO, etc.... This MARC file is the bibliographic record for the book.... 21 This XML file contains the full-text of the book. It uses the QRS DTD. It is named JKL and has a checksum of 555555....

.

Formats: Packages Incoming File System PublisherA 0008543x 2006 106 B CNCR21779 21779_ftp.pdf 21779_ftp.sgm equation aueq001.tif aueq002.tif nueq001.gif nueq002.gif image_m mfig001.jpg image_n nfig001.jpg image_t tfig001.gif 23

Formats: Packages Incoming File System Resulting Content Model PublisherA 0008543x 2006 106 B CNCR21779 21779_ftp.pdf 21779_ftp.sgm equation aueq001.tif aueq002.tif nueq001.gif nueq002.gif image_m mfig001.jpg image_n nfig001.jpg image_t tfig001.gif Content Unit (Article) Text: Marked Up Text 21779_ftp.sgm Rendition: Page Images 21779_ftp.pdf Component: Formula Graphic aueq001.tif nueq001.gif Component: Formula Graphic aueq002.tif nueq002.gif Component: Figure Graphic mfig001.jpg nfig001.jpg tfig001.gif 24

Formats: Packages Incoming File System Resulting Content Model PublisherA 0008543x 2006 106 B CNCR21779 21779_ftp.pdf 21779_ftp.sgm equation aueq001.tif aueq002.tif nueq001.gif nueq002.gif image_m mfig001.jpg image_n nfig001.jpg image_t tfig001.gif Content Unit (Article) Text: Marked Up Text 21779_ftp.sgm Rendition: Page Images 21779_ftp.pdf Component: Formula Graphic aueq001.tif nueq001.gif Component: Formula Graphic aueq002.tif nueq002.gif Component: Figure Graphic mfig001.jpg nfig001.jpg tfig001.gif 25

E-Book Packages in Portico Submissions Flat directory» ONIX xml file with bibliographic metadata, one PDF file per book Front Cover image JPG files 26

E-Book Packages in Portico Submissions TAR file (multiple books per file)» XML manifest file» One directory for each book, Proprietary XML file (3 possible versions of XML) with bibliographic metadata, Subdirectory with files for front matter chapters (XML. PDF, OCR of PDF) Subdirectory with files for regular chapters (XML. PDF, OCR of PDF) front Subdirectory with files for back matter chapters (XML. PDF, OCR of PDF) Subdirectory with TIFF file for cover image of book 27

E-Book Packages in Portico Submissions ZIP file (sometimes one book per file, sometime multiple books)» Sometimes flat (all books at one level)» Sometimes one directory for each book, Sometimes cover images (JPG or TIFF) Sometimes one PDF for entire book in addition to PDF for each chapter» Sometimes a manifest 28

Formats: Text Content Hello, World!! 29

Formats: Text Content BT /H2 <</MCID 0 >>BDC /CS0 cs 0.31 0.506 0.741 scn /TT0 1 Tf -0.004 Tc 0.006 Tw 12.96 0 0 12.96 72 697.68 Tm [(H)-4(e)-1(l)-1(l)- 11(o,)-3( W)-15(or)- 6(l)-11(d!)-12(!)]TJ 0 Tc 0 Tw 6.481 0 Td ( )Tj EMC ET Hello, World!! 30

Formats: Text Content <html> <head> <style type="text/css"> <!-- p { color: #4F81BD; font-family: serif; font-weight: bold; font-size: 13pt; } --> </style> </head> <body><p>hello, World!!</p></body> </html> Hello, World!! 31

Trade-offs: Expressiveness vs. Simplicity Hello, World!! 32

Formats: Rich Content Hello, World!! 33

Formats: Rich Content BT /H2 <</MCID 0 >>BDC /CS0 cs 0.31 0.506 0.741 scn /TT0 1 Tf -0.004 Tc 0.006 Tw 12.96 0 0 12.96 264 697.68 Tm [(H)-4(e)-1(l)-2(l)-11(o,)-3( W)-15(or)- 6(l)-11(d!)-12(!)]TJ 0 Tc 0 Tw 6.481 0 Td ( )Tj EMC /P <</MCID 1 >>BDC /CS1 cs 0 scn /TT1 1 Tf 11.04 0 0 11.04 72 682.08 Tm ( )Tj EMC /P <</MCID 2 >>BDC 36.478-24.185 Td ( )Tj EMC ET /Figure <</MCID 3 >>BDC q /GS0 gs 336 0 0 252 139.1000061 414.6812744 cm /Im0 Do Q EMC Hello, World!! 34

Formats: Rich Content Hello, World!! (itext RUPS) 35

Formats: Rich Content <html> <head> <style type="text/css"> <!-- p { color: #4F81BD; font-family: serif; font-weight: bold; font-size: 13pt; }--> </style> </head> <body><p>hello, World!! <br/><span><img width="447" height="336" src= images/image_001.j pg"/></span></p></body> </html> 36 Hello, World!!

Trade-offs: Encapsulation vs. Articulation mydir/ myfile.pdf mydir/ myfile.html images/ Image01.jpg 37

E-book formats in Portico Submissions PDF» One file per chapter» One file per book TIFF» One file per page JPEG» One file per page XML» For bibliographic metadata» Proprietary» ONIX variants» NLM variants 38

Looking ahead: EPUB 3 EPUB 3 (http://idpf.org/epub/30 )» EPUB defines a means of representing, packaging and encoding structured and semantically enhanced Web content-- including HTML5, CSS, SVG, images, and other resources-- for distribution in a single-file format.

Looking ahead: EPUB 3 EPUB 3» Web standards for key component technologies» Free and open specification» Must work in at least some appliance Outside publisher s own workflow

EPUB3 Packaging 41

EPUB3 Formats Profiles of standard formats for authoring content» XHTML5, SVG 1.1, CSS 2.1, CSS 3 Constraints (extensions to HTML5, constraints on SVG) Specs a moving target Conforming readers must support rendition of certain formats» Image, audio, video Defined fallbacks Globalization, Encoding, Fonts 42

Complications: The New Browser Wars Amazon» Announces it is replacing MOBI with K8 ibooks» Different mimetype» Proprietary extension of CSS Media Queries» Proprietary XML namespace» Etc. 43

Complications: "More What You d Call Guidelines Than Actual Rules Pirates of the Caribbean: The Black Pearl. The Walt Disney Company (2003) 44

Questions or Comments? Sheila Morrissey sheila.morrissey@ithaka.org @sheilamorr www.portico.org