TEI, METS and ALTO, why we need all of them. Günter Mühlberger University of Innsbruck Digitisation and Digital Preservation

Similar documents
The Functional Extension Parser (FEP) A Document Understanding Platform

Optical Layout Recognition (OLR)

The PAGE (Page Analysis and Ground-truth Elements) Format Framework

CCS Content Conversion Specialists. METS / ALTO introduction

Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments

Europeana Newspapers OCR Workflow Evaluation

Europeana Newspapers - A Gateway to European Newspapers Online

Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments *

The IMPACT Dataset of Historical Document Images

Post Digitization: Challenges in Managing a Dynamic Dataset. Jasper Faase, 12 April 2012

How to use TRANSKRIBUS a very first manual

D2.2: Data Maintenance (M24) Including D2.1. Data Collection and Ground Truth Annotation (M9)

Scenario Driven In-Depth Performance Evaluation of Document Layout Analysis Methods *

Historical Document Layout Analysis Competition

Paper Presented at 20th Business meeting

Digitising Special Collections Public-Private Partnerships at the KB and abroad

D2.1: Data Collection and Ground Truth Annotation

Workshop: Automatisierte Handschriftenerkennung

International Implementation of Digital Library Software/Platforms 2009 ASIS&T Annual Meeting Vancouver, November 2009

Unearthing the Recent Past: Digitising and Understanding Statistical Information from Census Tables

DATA MINING HISTORICAL NEWSPAPERS METADATA

Grid-Based Modelling and Correction of Arbitrarily Warped Historical Document Images for Large-Scale Digitisation

From Individual Solutions to Generic Tools Digitization at the Max Planck Society. Digitization Day 2012, Geneva Andrea Kulas

BUILDING A NEW DIGITAL LIBRARY FOR THE NATIONAL LIBRARY OF AUSTRALIA

Преглед НЦД 30 (2017), 7 17

Building for the Future

Digital preservation activities at the German National Library nestor / kopal

XML: Why and How JATS

Creating a Complete Workflow for Digitising Historical Census Documents: Considerations and Evaluation

Frederick Zarndt Semblanza

Automatic Ground-truth Generation for Document Image Analysis and Understanding

Enhanced retrieval using semantic technologies:

How to contribute information to AGRIS

Cybersecurity eit. Software. Certification. Industrial Security Embedded System

Practical Experiences with Ingesting Materials for Long-Term Preservation

For more information about how to cite these materials visit

Ground Truth for Layout Analysis Performance Evaluation

Exploring archives with probabilistic models: Topic modelling for the European Commission Archives

Mass Digitisation Enabling Access, Use and Reuse

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Mitigating Preservation Threats: Standards and Practices in the National Digital Newspaper Program

Whole World OLIF. Version 3.0 of the Versatile Language Data Format

SobekCM. Compiled for presentation to the Digital Library Working Group School of Oriental and African Studies

From Web Page Storage to Living Web Archives Thomas Risse

ECP-2008-DILI EuropeanaConnect. D5.7.1 EOD Connector Documentation and Final Prototype. PP (Restricted to other programme participants)

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION - QUALITATIVE

Use of RMI in the Online Delivery of Text-based content: Standards for the expression of publisher/library licenses

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

New Technology Briefing

Development of standard tools using EAD, EAC and METS for different kinds of archival material and for different aggregation levels

Metrics for Complete Evaluation of OCR Performance

Ground-Truth Production in the transcriptorium Project

ISO/IEC Information technology Multimedia content description interface Part 7: Conformance testing

verapdf after PREFORMA

4. Backlink Analysis Check backlinks What Else? Analyze historical data... 29

Alphabet Soup: A Metadata Overview Melanie Schlosser Metadata Librarian

The Mormon Diaries Project

BUDDHIST STONE SCRIPTURES FROM SHANDONG, CHINA

EUDAT. Towards a pan-european Collaborative Data Infrastructure

Digital The Harold B. Lee Library

Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

NERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017

Standards, standardisation & INSPIRE Status, issues, opportunities

Casabac Unicode Support

Collection Policy. Policy Number: PP1 April 2015

Comp 336/436 - Markup Languages. Fall Semester Week 2. Dr Nick Hayward

Transkribus Transcription Conventions

European digital repository certification: the way forward

RUtgers COmmunity REpository (RUcore)

Best Practices for Accessible Course Materials

Trustworthy ICT. FP7-ICT Objective 1.5 WP 2013

Instructions for Using the Organic Letters Manuscript Template

BHL-EUROPE: Biodiversity Heritage Library for Europe. Jana Hoffmann, Henning Scholz

Overview of the ICDAR 2013 Competition on Book Structure Extraction

Trusted Digital Archives

The Natural Partner. Increasing Awareness of Publisher Content in Target Markets

The digital preservation technological context

Research partnerships, user participation, extended outreach some of ETH Library s steps beyond digitization

EDEN An Epigraphic Web Database of Ancient Inscriptions

Smart Grid Objective 6.1 in WP2013

Semi-Automatic Generation of Arabic Digital Talking Books

NordMAN: The Nordic Microdata Access Network

Defining Economic Models for Digital Libraries :

Open Archives Forum - Technical Validation -

Building the CIARD Framework for Data and Information Sharing: the case of France & INRA

What is Islandora? Islandora is an open source digital repository that preserves, manages, and showcases your institution s unique material.

Google indexed 3,3 billion of pages. Google s index contains 8,1 billion of websites

Building a Linked Open Data Knowledge Graph Henning Schoenenberger Michele Pasin. Frankfurt Book Fair 2017 October 11, 2017

Manuscriptorium for Candidates (M-Can) Help.

Intelligent Information Management

This document is a preview generated by EVS

Sustainability of Linked Open Data A Key Challenge for Agricultural Applications

UKOLN involvement in the ARCO Project. Manjula Patel UKOLN, University of Bath

Preservation of Web Materials

Benefit from Avira being an Affiliate Partner!

TBX in ODD: Schema-agnostic specification and documentation for TermBase exchange

Archiving the Web: What can Institutions learn from National and International Web Archiving Initiatives

Information System on Literature in the Field of ICT for Environmental Sustainability

e-sens Electronic Simple European Networked Services

FOUNDATION IN GRAPHIC DESIGN. with ADOBE APPLICATIONS

Transcription:

TEI, METS and ALTO, why we need all of them Günter Mühlberger University of Innsbruck Digitisation and Digital Preservation

Agenda Introduction Problem statement Proposed solution

Starting point Mass digitisation takes place! Case 1 University of Innsbruck Digitisation of 216.000 German dissertations from 1925 to 1988 24 mill. pages Case 2 EU Newspaper Project 8 mill. pages for Optical Character Recognition (OCR) 2 mill. pages for structural enhancement (article tracking) = factor 1:5-10 compared to books Case 3 Austrian National Library and Bavarian State Library: Probably 50-70% of all German language books from 1500 to 1870 digitised by Google 1,3 Mill. books, 500 mill. pages Many more projects and programmes are going on Handwritten Text Recognition (HTR) will complete the picture Takes off in 5-10 years Mass sources will become available by automated processes

Objectives Make these highly important sources available to the CLARIN community Transform these sources into TEI in a meaningful and highly standardized way in order to support reusability and sustainability

Problem statement How shall this transformation be done? TEI community is usually manually preparing TEI files E.g. Deutsches Textarchiv (DTA): Instructions to generate (simplified) TEI based corpus material Automation is used to assist manual preparation of the files BUT: Simple calculation: to correct 1000 characters usually costs about 0,5 to 1 EUR (off-shore prices) A book page has about 2500 characters, a journal page about 5000-8000 characters and a newspaper page from 5000 30.000 characters or even more. Manual correction would cost at least 100 times more than the whole digitisation no one ever will pay for that Conclusion: Manual preparation is not usable for digitized mass sources we need to find another way!

Availability of text and structure Mass digitisation of text results from Optical Character Recognition (OCR) Text Segmentation information: location of blocks, lines, words, characters on a page image Discrimination of text vs. graphical blocks vs. tables vs. noise Enhanced processes E.g. rule based approaches or machine learning approaches Structural data such as paragraphs as logical unit, chapters, running titles, footnotes depending on type of document Example of EU Newspaper Project 8 Mill. pages OCR, 2 Mill. pages basic structural segmentation Technical metadata (file type, file size, MD5 checksum, etc.) are generated automatically with File Analyzer Tool (FAT) OCR and OLR transformation METS container according to ENMAP (European Newspaper Mets Alto Profile)

ALTO format OCR Data Recorded in ALTO files ALTO introduced by University of Innsbruck within the METADATA ENGINE project (FP5 2000-2003) Promoted by CCS Gmbh Hamburg/Germany Now hosted by Library of Congress as an addition to Metadata Encoding and Transition Standard (METS) National libraries and university libraries from all over the world (USA, Australia, Singapore, Norway, Finland, British Library, Biblioteque National de France, Austrian National Library, etc.) are using this format Industry support: ABBYY SDK provides native ALTO output since 2011 Since 2012 also officially supported/requested from the German Science Funds (Digitisation guidelines) Format has some issues (discussion with board is going on) but is a de-facto standard and hardly to replace (there is only the Google format with more files but also much more concerns)

Structural data OCR is just the beginning! IMPACT project Development of a general purpose rule set to extract typical features from books E.g. headings, running titles, page numbers, footnotes, table of contents pages, TOC entries, etc. All this is done on top of the OCR results (stored as ALTO files) Results are encouraging Some features can be detected with rather good accuracy (e.g. running titles, page numbers) some are weaker but still rather helpful (e.g. headings, footnotes) For specific document types (e.g. journals) higher accuracy rates can be reached due to their homogeniety

Crowd sourced correction Australian National Library Newspaper digitisation programme Simple editor to correct text on line level Reaction Extremely positive Ten-thousands of articles (partly) corrected Result Not a solution to correct all historical Australian newspapers But by sure an improvement and something which will be offered by many more libraries in the future Consequence Situation becomes even more complex!

Conclusion Complex situation A given document does not have one level of accuracy, but it may come with very different results on feature level (e.g. text, layout, structure,...) and some parts may even be corrected by human beings How to cope with this situation?

Proposed solution Situation of automatically produced noisy text data is relatively new - BUT for researchers (especially historians) this is not a new situation! Sources ALWAYS had to be assessed/verified in a critical way: This is exactly what we need to do for mass sources!

Ground truth Prerequisites for automated assessment Evaluation of results in technical projects is often done via so-called Ground truth or a Golden standard These data represent the expected results of a given process The difference between target and actual rsults allows to figure out improvements, or to compare several results from different research groups ICDAR conference Several competitions on e.g. Historical newspaper structure recognition or Handwritten text recognition, or eg. Linking of table of contents entries with headings and page numbers, etc.

Automated assessment ALETHEIA tool and PAGE format Important work from PRIMA group (Apostolos Antonacopoulos, Stefan Pletschacher) at University of Salford PAGE format = rich XML format to describe layout features of printed text ALETHEIA = tool with GUI for manual production of correct text with layout features Evaluation scenarios Evaluation is a complex task Many dimensions and aspects have to be taken into account Later usage strongly predicts what needs to be evaluated to get meaningful results E.g. evaluation of the quality of a full-text depends on what shall be done with the full-text: Does the reading order play a role, do structural elements play a role, etc.

Assessed feature list (example) Feature Assessed Measure Results (F- Running text 1000 Randomly selected pages Paragraphs 5000 Weighted random selection Column titles 5 per book Random selection Footnotes... Measure) 92 (87-95) 67 (50-80) 85 (81-89)

Encoding METS Provides exactly the structure to encode the feature list + assessment results Container format with several predefined sections (descriptive, administrative, structural data) Administrative metadata section Comparable to technical metadata for image files Information is linked to the appropriate sections in the files via ALTO coordinates (block, line, string) Assessment schema for textual and structural features of automatically processed documents Description of features Conventions how many items of a feature need to be assessed to get meaningful results Assessment methods (e.g. algorithms for calculation) CLARIN? TEI files can be produced from METS/ALTO their usability for CLARIN-based processes will depend on the assessment results

Pilot project EU Newspaper Project PRIMA group is partner Evaluation of texts is foreseen in Document of Work (DoW) ENMAP schema is currently developed by University of Innsbruck Pilot In 2014 we should be able to come up with a first attempt

Thank you for your attention! Günter Mühlberger <guenter.muehlberger@uibk.ac.at>