Text Mining for Historical Documents Digitisation and Preservation of Digital Data

Similar documents
Digital s ideal partner

When it comes to information archive strategy, digital has met its ideal partner.

Technical Challenges in Digital Preservation

Digital Preservation in Theory and in Practice

Summary of Bird and Simons Best Practices

Section 10 Managing Records in Special Formats

Computer Devices Part 1 25 Question(s) Test ID:

Let s Review Lesson 2!

7.3. In t r o d u c t i o n to m e t a d a t a

The digital preservation technological context

Archiving and Migrating Digital Files

8/25/2016. What Is a Computer? The Components of a Computer

access to reformatted and born digital content regardless of the challenges of media failure and technological

CTAS e-li. Published on e-li ( April 28, 2018 Electronic Records

SOFTWARE AND MULTIMEDIA. Chapter 6 Created by S. Cox

Hardware and Operating Systems.

Introduction to Computers. Joslyn A. Smith

Here are the files that are included in each data type:

Migration of the digitised content Strategies and Case Studies

Post Digitization: Challenges in Managing a Dynamic Dataset. Jasper Faase, 12 April 2012

Flash USB Drive. HwaZu Disk User Manual. Product Image File

PDF/A - The Basics. From the Understanding PDF White Papers PDF Tools AG

You can change your preferences at any time. If after a while you want a greater challenge, you can opt to receive more difficult material.

anyocr: An Open-Source OCR System for Historical Archives

CHAPTER 8 Multimedia Information Retrieval

Developing an Electronic Records Preservation Strategy

Digital Archaeological Resources at the University of Bergen: An Efficient Tool in Research and Heritage Management?

Contents. Page. Installation ClaroRead Toolbar ClaroRead Features Font... 4 Spacing Homophones... 11

Optical Character Recognition Applied to Romanian Printed Texts of the 18th 20th Century

Preventing Data Loss Through Prudent Archiving

LTFS and SpaceLTO on Tape Layouts Comparison

How to use TRANSKRIBUS a very first manual

GUIDELINES FOR CREATION AND PRESERVATION OF DIGITAL FILES

File Services. File Services at a Glance

Verify your customers quickly and easily wherever they are in the world

The distant future of the recent past

Quick Start Guide Natural Reader 14 (free version)

QUICKSTART GUIDE. This QuickStart Guide will help you to get up and running quickly on the AT Home site. You ll find the following information:

Linked Data and cultural heritage data: an overview of the approaches from Europeana and The European Library

Finding the right resource for you

4. TECHNOLOGICAL DECISIONS

How Long to Keep Records & Legally Dispose of Them. Virginia Fritzsch Public Records Archivist Wisconsin Historical Society

Automated Tagging to Enable Fine-Grained Browsing of Lecture Videos

Week 0: Intro to Computers and Programming. 1.1 Why Program? 1.2 Computer Systems: Hardware and Software. Hardware Components

How to Edit Your Website

SharePoint Archival Storage Strategies & Technologies January Porter-Roth Associates 1

5. a computer which CPU speed around 100 million instruction per second and with the word length of around 64 bits is known as

Virtual Campfire Digital Media Discourses between Cultures, Continents and Generations

Chapter Objectives 1 of 2. Chapter 3. The Operating System. Chapter Objectives 2 of 2. The Operating System. The Operating System

Read&Write 5 GOLD FOR MAC MANUAL

A DIGITAL APPROACH TO HANDWRITTEN DOCUMENTS. B.I.T. - Bureau Ingénieur Tomasi

Emerging Trends in Records Management Technology. Jessie Weston, CRA 2018 MISA Conference October 11-12, 2018

May Read&Write 5 Gold for Mac Beginners Guide

Software for Digital Library

This lesson was made possible with the assistance of the following organisations:

Input output and memory devices

Collection Policy. Policy Number: PP1 April 2015

echive The Challenge Open Archive Systems The Solution echive Series Overview e c h .tif Drawing and Document Management Made Easy .

Where d My Photos Go? Challenges in Preserving Digital Data for the Long Term

Mary Manning Eric Schnell

Computer Skills: Files and Folders Tutorial

Desktop Publishing. Desirable Features. Key Stages 1 & 2. Pupils should: Level. Level 2. Level 3. Level 4. Level 5

by AssistiveWare Quick Start

How to Edit Your Website

Digital Archeology: Recovering Digital Objects from Audio Waveforms

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

The role of PIONIER network in long term preservation services for cultural heritage institutions in Poland

Day 2. Central Processing Unit (CPU) + Input Devices + Output Devices

1. Most people feel comfortable purchasing complex devices, such as cars, home theater systems, and computers.

TABLE OF CONTENTS. Introduction Setting up Your Patriot Voice Controls Starting the System Controls...

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION - QUALITATIVE

Software license agreement

Pharmacy college.. Assist.Prof. Dr. Abdullah A. Abdullah

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION: QUALITATIVE

Memory Study Material

Named Entity Identification / Disambiguation

Chapter 6A. Describing Storage Devices. Describing Storage Devices. Types of Storage Devices. Store data when computer is off Two processes

Digital Preservation in Theory and Practice

General Technical Information

GCSE ICT AQA Specification A (Short Course) Summary

Chapter 23. Creating Reports with ifix

Accessibility Guidelines

Big Data Challenges in the Federal Government. July 8, 2013

Revision. Paper 1 & Paper 2

Taking the plunge: digital archives at HSBC

Jose Maceda Collection Digitization at University of the Philippines Center for Ethnomusicology: Towards a Digital Audio Library

OPERATING SYSTEMS CS136

OFFICE COMPUTER RETAILING

Managing Information Resources

Introduction to MS Word XP 2002: An Overview

Storage and Preservation. Week 3 LBSC 671 Creating Information Infrastructures

1. Please rate your ability to perform the following tasks relating to computer hardware knowledge.

Cyber Hygiene: Uncool but necessary. Automate Endpoint Patching to Mitigate Security Risks

Using the Internet in Corporate Reporting Jerry Trites, FCA, Zorba Research Inc. Tim Herrod, CA, Potash Corporation of Saskatchewan Inc.

HCI: ACCESSIBILITY. Dr Kami Vaniea

Large Scale Data Analysis Using Deep Learning

Computer Science (330)

Using Optical Character Recognition on Scanned Text

Digital Preservation Preservation Strategies

Saving and protecting digital writing and documents

Transcription:

Digitisation and Preservation of Digital Data Computational Linguistics Universität des Saarlandes Wintersemester 2010/11 21.02.2011

Digitisation

Why digitise? Advantages of Digitisation safeguard against data loss (e.g., City Archive of Cologne) easier access (for experts) safe access to fragile documents worldwide instant and continuous simultaneous virtual exhibitions (for laypersons) draw in new customers increase visitor numbers (museum) stimulate tourism (country) showcase artefacts typically not on display (approx. of a collection) (semi-)automatic data analysis

Why digitise? Advantages of Digitisation safeguard against data loss (e.g., City Archive of Cologne) easier access (for experts) safe access to fragile documents worldwide instant and continuous simultaneous virtual exhibitions (for laypersons) draw in new customers increase visitor numbers (museum) stimulate tourism (country) showcase artefacts typically not on display (approx. 95% of a collection) (semi-)automatic data analysis

How to digitise? For textual data 1 digital photographs / scanning 2 optical character recognition (OCR) or manual transcription

Optical Character Recognition In the ideal case word error rates of around 1% on typed text (Lopresti, 2005) However historical documents often pose problems

Optical Character Recognition In the ideal case word error rates of around 1% on typed text (Lopresti, 2005) However historical documents often pose problems bad print quality (stains, faded letters etc.) old-fashioned language (messes up language models) old-fashioned fonts

Example: Fraktur Source: http: //www.suetterlinschrift.de/lese/schriftgeschichte/fraktur2.htm

OCR Problems Typical errors mixing up letter sequences that look similar (e.g., ii, ü, n, il, li ) missing punctuation characters replacing letters by space or punctuation This is problematic because it adversely affects retrieval results in keyword-based search it hampers further language processing (tokenisation, especially sentence splitting) OCR errors are more difficult to correct than typos because the erroneous words deviate often more strongly from the correct word because errors can affect sequences of letters with no one-to-one mapping (e.g., iii m ) errors can affect multiple positions in a word (e.g., e c )

Correcting OCR Errors Techniques used for spell checking don t work very well Levenshtein distance / edit distance (often high for OCR errors) dictionary look-up (problematic for historical data) Possible solutions implement or learn heuristics (e.g., c e ) use taylor-made resources (e.g., historical dictionaries) combine predictions of different OCR systems (e.g., by voting) (see Volk et al., to appear) identify likely variants of a form and use statistics to detect errors (see Reynart, 2008)

Handwritten Documents OCR infeasible Solution word error rates of 50% and more (Rath et al., 2004) for historical documents probably even worse manual transcription partial manual transcription and search via visual similarity automatic alignment on paragraph-/sentence-level to simulate search in original document

Audio Data Speech is similarly difficult to transcribe automatic speech recognition challenging for open domains and multitude of speakers historical collections pose additional problems quality of old recordings speaker factors (emotion, non-native speech, old age) possibly somewhat old-fashioned speech

Digital Data Preservation

Data Longevity Rosetta Stone: approx. 2200 years old Source: http://en.wikipedia.org/wiki/file:rosetta_stone.jpg

Data Longevity Rhind Papyrus: approx. 3660 years Source: http://en.wikipedia.org/wiki/file:rhind_mathematical_papyrus.jpg

Data Longevity Floppy Disk: approx. 25 years Source: http://en.wikipedia.org/wiki/file:floppy_disk_2009_g1.jpg

Digital Data Typically very short life cycle Several prominent cases of digital data loss NASAs 1976 Viking Mars mission (stored on magnetic tape which deteriorated) BBC s Domesday Project, mid 1980s (outdated software, storage format)

Digital Data Typically very short life cycle physical decay (bit rot) (avg. life span of a CD-Rom is 5 years (Horlings, 2003)) outdated media (e.g., floppy disk) outdated formats (e.g., word processing, Word Perfect) loss of context (e.g., lost decryption keys) Several prominent cases of digital data loss NASAs 1976 Viking Mars mission (stored on magnetic tape which deteriorated) BBC s Domesday Project, mid 1980s (outdated software, storage format)

Data Preservation (1) Solution very careful data management (not a solved problem yet) adherence to open standards such as XML (as opposed to proprietory formats) refreshing (i.e., periodically copying data to new storage media, which are ideally kept in different locations) migration (of data to newer storage formats) emulation (of outdated software on modern hardware)

Data Preservation (2) But challenging for most cultural heritage institutes very expensive and time consuming to implement amount of data heterogeneity of data, plethora of formats databases photos scans word processing documents DOS, Windows, Mac, OS/2, Linux most institutes ignore the problem for the time being, no long-term data management