Workshop: Automatisierte Handschriftenerkennung

Similar documents
Interactive Handwritten Text Recognition and Indexing of Historical Documents: the transcriptorum Project

The Functional Extension Parser (FEP) A Document Understanding Platform

How to use TRANSKRIBUS a very first manual

D2.2: Data Maintenance (M24) Including D2.1. Data Collection and Ground Truth Annotation (M9)

TEI, METS and ALTO, why we need all of them. Günter Mühlberger University of Innsbruck Digitisation and Digital Preservation

D2.1: Data Collection and Ground Truth Annotation

Ground-Truth Production in the transcriptorium Project

Natural Language Inspired Approach for Handwritten Text Line Detection in Legacy Documents

ICFHR 2014 COMPETITION ON HANDWRITTEN KEYWORD SPOTTING (H-KWS 2014)

Handwritten Text Recognition

Research partnerships, user participation, extended outreach some of ETH Library s steps beyond digitization

Post Digitization: Challenges in Managing a Dynamic Dataset. Jasper Faase, 12 April 2012

COLLABORATIVE EUROPEAN DIGITAL ARCHIVE INFRASTRUCTURE

EUROPEAN COMISSION INFORMATION SOCIETY AND MEDIA DIRECTORATE-GENERAL. Information and Communication Technologies. Collaborative Project

Transkribus Transcription Conventions

Google indexed 3,3 billion of pages. Google s index contains 8,1 billion of websites

Enhanced retrieval using semantic technologies:

The Mormon Diaries Project

Two interrelated objectives of the ARIADNE project, are the. Training for Innovation: Data and Multimedia Visualization

Overview of ImageCLEF Mauricio Villegas (on behalf of all organisers)

Contents. Resumen. List of Acronyms. List of Mathematical Symbols. List of Figures. List of Tables. I Introduction 1

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

Data Exchange and Conversion Utilities and Tools (DExT)

A System to Automatically Index Genealogical Microfilm Titleboards Introduction Preprocessing Method Identification

[February 1, 2017] Neural Networks for Dummies Rolf van Gelder, Eindhoven, NL

anyocr: An Open-Source OCR System for Historical Archives

BHL-EUROPE: Biodiversity Heritage Library for Europe. Jana Hoffmann, Henning Scholz

COMMIUS Project Newsletter COMMIUS COMMUNITY-BASED INTEROPERABILITY UTILITY FOR SMES

Segmentation of Kannada Handwritten Characters and Recognition Using Twelve Directional Feature Extraction Techniques

Music Document Layout Analysis through Machine Learning and Human Feedback

Optical Character Recognition Applied to Romanian Printed Texts of the 18th 20th Century

Developing a Research Data Policy

D6.1.2: Second report on scientific evaluations

Encoding and Designing for the Swift Poems Project

Introduction

Micro Focus Partner Program. For Resellers

OUR VISION To be a global leader of computing research in identified areas that will bring positive impact to the lives of citizens and society.

Automatic Speech Recognition (ASR)

HPC IN EUROPE. Organisation of public HPC resources

Bringing the Voices of Communities Together:

DjVu Technology Primer

International Journal of Advance Research in Engineering, Science & Technology

Automatic Reader. Multi Lingual OCR System.

PERIODIC REPORT 3 KYOTO, ICT version April 2012

T H E D I G I TA L L I B R A R Y

Document downloaded from: This paper must be cited as:

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION: QUALITATIVE

GEOFLUENT VS. STRAIGHT MACHINE TRANSLATION: UNDERSTANDING THE DIFFERENCES

Medici for Digital Cultural Heritage Libraries. George Tsouloupas, PhD The LinkSCEEM Project

ARKive-ERA Project Lessons and Thoughts

Visual Concept Detection and Linked Open Data at the TIB AV- Portal. Felix Saurbier, Matthias Springstein Hamburg, November 6 SWIB 2017

LECTURE 6 TEXT PROCESSING

[ PARADIGM SCIENTIFIC SEARCH ] A POWERFUL SOLUTION for Enterprise-Wide Scientific Information Access

The Sunshine State Digital Network

Security Information & Event Management (SIEM)

Evaluation in Quaero. Edouard Geoffrois, DGA Quaero Technology Evaluation Manager. Quaero/imageCLEF workshop Aarhus, Denmark Sept 16 th, 2008

Tools for Data Management. Research Data Management : Session 3 9 th June 2015

International Implementation of Digital Library Software/Platforms 2009 ASIS&T Annual Meeting Vancouver, November 2009

ProQuest Dissertations and Theses Overview. Austin McLean and Marlene Coles CGS Summer Workshop, July 2017

Digitally Preserving African Heritage

Safeguarding Digital Heritage through Sustained Use of Legacy Software

EUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture. Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members

PDF/A - The Basics. From the Understanding PDF White Papers PDF Tools AG

Universal Access Tip Sheet: Creating Accessible PDF Files from a Scanned Document

Maryland OneStop Statewide License Portal State of Maryland Department of Information Technology

State of the Art and Trends in Search Engine Technology. Gerhard Weikum

CREATIVE. web design

Patent Image Retrieval

RINGTAIL A COMPETITIVE ADVANTAGE FOR LAW FIRMS. Award-winning visual e-discovery software delivers faster insights and superior case strategies.

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION - QUALITATIVE

German Research Strategy in the Area of Civil Security Research

Historical Big Data: Reconstructing the Past through Integrated Analysis of Historical Data

Text Mining for Historical Documents Digitisation and Preservation of Digital Data

Data Curation Profile Human Genomics

The MUSING Approach for Combining XBRL and Semantic Web Data. ~ Position Paper ~

Automated Visualization Support for Linked Research Data

OnLine Handwriting Recognition

Expressions of Interest

From Handwriting Recognition to Ontologie-Based Information Extraction of Handwritten Notes

ABBYY FineReader 14. User s Guide ABBYY Production LLC. All rights reserved.

SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered.

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

Title: Interactive data entry and validation tool: A collaboration between librarians and researchers

Natural Language Inspired Approach for Handwritten Text Line Detection in Legacy Documents

Website Optimizer. Before we start building a website, it s good practice to think about the purpose, your target

Explicit fuzzy modeling of shapes and positioning for handwritten Chinese character recognition

DoConference Web Conferencing: DoMore DoConference

TIB AV-Portal. Margret Plank 19th of January 2015 TACC Meeting

Basic Requirements for Research Infrastructures in Europe

arxiv: v1 [cs.dl] 22 Nov 2017

Handwritten Gurumukhi Character Recognition by using Recurrent Neural Network

Hello everyone, how are you enjoying the conference so far? Excellent!

SEPTEMBER 2018

Search Engines. Information Retrieval in Practice

Viterbi Based Alignment between Text Images and their Transcripts

Rethinking Semantic Interoperability through Collaboration

Sparta Systems TrackWise Digital Solution

For Attribution: Developing Data Attribution and Citation Practices and Standards

I data set della ricerca ed il progetto EUDAT

Transcription:

Workshop: Automatisierte Handschriftenerkennung Joan Andreu Sánchez Pattern Recognition and Human Language Research group (Technical University of Valencia) Günter Mühlberger, Sebastian Colutto, Philip Kahle Digitisation and Digital Preservation group (University of Innsbruck)

Agenda Part 1: Handwritten Text Recognition (HTR) Part 2: Introduction to TRANSKRIBUS (Transcription and Recognition Platform) Part 3: Introduction to the expert GUI of TRANSKRIBUS Part 4: Hands-on-Session Discussion 02/03/2015 2

Interactive Handwritten Text Recognition and Indexing of Historical Documents: the transcriptorum Project V. Romero, A.H. Toselli, M. Villegas, J.A. Sánchez, E. Vidal {vromero,ahector,mvillegas,jandreu,evidal}@prhlt.upv.es Presenter: Joan Andreu Sánchez Pattern Recognition and Human Language Technology Reseach Center Universitat Politècnica de València (Spain) February 2015 DHd-Tagung 2015

Index 1 Handwritten Text Recognition (HTR) and Indexing 1 2 The transcriptorium Project 3 3 Selected Handwritting Datasets 5 4 HTR and Interactive-Predictive HTR 10 5 Interactive HTR: Transcription Demonstration 12 6 HTR and Interactive-Predictive HTR Results 14 7 Handwritten Text Images Indexing: Search Demonstration 16 8 Handwritten Text Images Indexing and Search Results 19 9 Conclusion 22 J.A. Sánchez PRHLT/UPV Page 1

Index 1 Handwritten Text Recognition (HTR) and Indexing 1 2 The transcriptorium Project 3 3 Selected Handwritting Datasets 5 4 HTR and Interactive-Predictive HTR 10 5 Interactive HTR: Transcription Demonstration 12 6 HTR and Interactive-Predictive HTR Results 14 7 Handwritten Text Images Indexing: Search Demonstration 16 8 Handwritten Text Images Indexing and Search Results 19 9 Conclusion 22 J.A. Sánchez PRHLT/UPV Page 1

Handwritten Text Recognition (HTR) and Indexing Huge amounts of handwritten historical documents are being published by on-line digital libraries world wide However, for these raw digital images to be really useful, they need be annotated with informative content This presentation introduces efficient solutions for the indexing, search and full transcription of historical handwritten document images J.A. Sánchez PRHLT/UPV Page 2

Index 1 Handwritten Text Recognition (HTR) and Indexing 1 2 The transcriptorium Project 3 3 Selected Handwritting Datasets 5 4 HTR and Interactive-Predictive HTR 10 5 Interactive HTR: Transcription Demonstration 12 6 HTR and Interactive-Predictive HTR Results 14 7 Handwritten Text Images Indexing: Search Demonstration 16 8 Handwritten Text Images Indexing and Search Results 19 9 Conclusion 22 J.A. Sánchez PRHLT/UPV Page 3

The transcriptorium Project http://www.transcriptorium.eu STREP of the FP7 in the ICT for Learning and Access to Cultural Resources challenge (1 January 2013 to 31 December 2015) transcriptorium aims to develop innovative, efficient and cost-effective solutions for the indexing, search and full transcription of historical handwritten document images, using modern, holistic Handwritten Text Recognition technology 1. Enhancing HTR technology for efficient transcription 2. Bringing the HTR technology to users 3. Integrating the HTR results in public web portals Supported by: EU Cultural Heritage: J.A. Sánchez PRHLT/UPV Page 4

Index 1 Handwritten Text Recognition (HTR) and Indexing 1 2 The transcriptorium Project 3 3 Selected Handwritting Datasets 5 4 HTR and Interactive-Predictive HTR 10 5 Interactive HTR: Transcription Demonstration 12 6 HTR and Interactive-Predictive HTR Results 14 7 Handwritten Text Images Indexing: Search Demonstration 16 8 Handwritten Text Images Indexing and Search Results 19 9 Conclusion 22 J.A. Sánchez PRHLT/UPV Page 5

Selected Handwritting Datasets BENTHAM: XVIII/XIX centuries collection of over 4, 000 pages of drafts and notes, written by several hands in English PLANTAS: XVII century botanical specimen manuscript collection of seven volumes written by a single hand in Old Spanish kindly provided by the BNE HATTEM: XV century Medieval Manuscript composed of 573 sheets written by a single hand in Dutch ESPOSALLES: XVII century Marriage License records written by several hands in old Catalan and other languages AUSTEN: XVIII century Juvenilia manuscripts by Jane Austen (single hand in English) kindly provided by the BL REICHSGERICHT: early XX century manuscripts of court decisions written by a several hands in German J.A. Sánchez PRHLT/UPV Page 6

BENTHAM Dataset XVIII century collection of over 4, 000 sheets of drafts and notes, written by several writers in English Experiments on a first batch of 433 pre-selected page images Number of: Total Pages 433 Lines 11 473 Running words 106 905 Lexicon size 9 717 Running characters 550 674 Character set size 86 J.A. Sánchez PRHLT/UPV Page 7

AUSTEN Dataset Jane Austen s Juvenilia: XVIII century single hand manuscript Experiments on Volume The Third Number of: Total Pages 128 Lines 2 693 Running words 25 291 Dataset lexicon 3 567 Running characters 118 881 Character set size 81 J.A. Sánchez PRHLT/UPV Page 8

DHd-Tagung 2015 REICHSGERICHT Dataset Court decisions from the German High Court from 1900-1914. Experiments on a first batch of 114 pre-selected page images Number of: Pages Lines Running words Dataset lexicon Running characters Character set size J.A. Sa nchez PRHLT/UPV Total 114 4 046 40 566 6 098 251 813 92 Page 9

Index 1 Handwritten Text Recognition (HTR) and Indexing 1 2 The transcriptorium Project 3 3 Selected Handwritting Datasets 5 4 HTR and Interactive-Predictive HTR 10 5 Interactive HTR: Transcription Demonstration 12 6 HTR and Interactive-Predictive HTR Results 14 7 Handwritten Text Images Indexing: Search Demonstration 16 8 Handwritten Text Images Indexing and Search Results 19 9 Conclusion 22 J.A. Sánchez PRHLT/UPV Page 10

HTR and Interactive-Predictive HTR HTR current state-of-the-art: Segmentation-free approach: no explicit segmentation of text images into words or characters is required The basic input unit is a handwritten text line image Statistical modeling at different perception levels: Optical (character shape), using Hidden Markov Models (HMMs) Lexical, by means of finite-state character representation of words Syntactical, based on statistical language models, such as N-grams Interactive-predictive framework: rather than full transcription automation, the system assists the human transcriber Combines HTR efficiency with the accuracy of human experts, leading to cost-effective perfect transcripts J.A. Sánchez PRHLT/UPV Page 11

HTR Architecture Text Line Image Preprocessing PREPROCESSING AND FEATURE EXTRACTION Character HMMs Word Models p(x w) Image Feature Extraction HANDWRITTEN TEXT DECODING Transcript [ home is locale next ] P (w) = ŵ = x N GRAM Lenguage Model Feature Extraction x = x 1, x 2,..., x n, x i R D Decoding ŵ = arg max w = arg max w p(w x) p(x w) P (w) Huge sets of N-best hypotheses can be arranged into a Word Graph or Lattice. J.A. Sánchez PRHLT/UPV Page 11

Index 1 Handwritten Text Recognition (HTR) and Indexing 1 2 The transcriptorium Project 3 3 Selected Handwritting Datasets 5 4 HTR and Interactive-Predictive HTR 10 5 Interactive HTR: Transcription Demonstration 12 6 HTR and Interactive-Predictive HTR Results 14 7 Handwritten Text Images Indexing: Search Demonstration 16 8 Handwritten Text Images Indexing and Search Results 19 9 Conclusion 22 J.A. Sánchez PRHLT/UPV Page 12

Interactive HTR: Transcription Demonstration It is just a demo! not intended for real operation (other systems do that) Everything is real. No tricks to make demo look better than real Web client-server architecture: Web browser front-end, back-end server providing off-line HTR-CATTI Off-line HTR-CATTI decoder based on word graphs Three tasks: BENTHAM: 78K words open vocabulary AUSTEN: 78K words external, open vocabulary from Bentham texts 20K words external, open vocabulary from Austen texts REICHSGERICHT: 6K words open vocabulary J.A. Sánchez PRHLT/UPV Page 13

Index 1 Handwritten Text Recognition (HTR) and Indexing 1 2 The transcriptorium Project 3 3 Selected Handwritting Datasets 5 4 HTR and Interactive-Predictive HTR 10 5 Interactive HTR: Transcription Demonstration 12 6 HTR and Interactive-Predictive HTR Results 14 7 Handwritten Text Images Indexing: Search Demonstration 16 8 Handwritten Text Images Indexing and Search Results 19 9 Conclusion 22 J.A. Sánchez PRHLT/UPV Page 14

HTR and Interactive-Predictive HTR Results BENTHAM: Training: OMs with 400 pages, LM Lex. 78K words. Test: 33 pages. WER = 22.0 % WSR = 17.2 % EFR: 21.5 % wrt post-edit CER = 9.9 % AUSTEN: No training; just using Bentham models WER = 45.0 % WSR: 27.5 % EFR: 38.9 % wrt post-editing CER = 25.5 % AUSTEN: Training: OMs with 50 pages, LM Lexicon 20K words. Test: 78 pages WER = 32.2 % WSR = 21.4 % EFR: 33.5 % wrt post-editing CER = 15.9 % REICHSGERITCH: Training: OMs with 88 pages, LM Lex. 6K words. Test: 26 pag. WER = 33.3 % WSR: 25.1 % EFR: 24.6 % wrt post-editing CER = 14.5 % WER/CER: percentage of mis-recognized words/characters. Experiments with open-vocabulary lexica and bi-gram LMs. WSR = Percentage of word-level corrections to achieve ground truth transcripts. EFR = Estimated Effort Reduction. J.A. Sánchez PRHLT/UPV Page 15

Index 1 Handwritten Text Recognition (HTR) and Indexing 1 2 The transcriptorium Project 3 3 Selected Handwritting Datasets 5 4 HTR and Interactive-Predictive HTR 10 5 Interactive HTR: Transcription Demonstration 12 6 HTR and Interactive-Predictive HTR Results 14 7 Handwritten Text Images Indexing: Search Demonstration 16 8 Handwritten Text Images Indexing and Search Results 19 9 Conclusion 22 J.A. Sánchez PRHLT/UPV Page 16

Handwritten Text Images Indexing and Search There are massive text image collections out there, but their textual content remains practically inaccessible If perfect or sufficiently accurate text image transcripts were available, image textual context could be strightforwardly indexed for plaintext textual access. But fully automatic transcription results lack the level of accuracy needed for useful text indexing and search purposes And manual or even interactive-predictive assited transcription is entierely prohibitive to deal with massive image collections Good news: indexing and search can be directly implemented on the images themselves, without explicitly resorting to any image transcripts, as we will see now. J.A. Sánchez PRHLT/UPV Page 17

Handwritten Text Images Indexing and Search: Demonstration It is just a demo! not (yet) intended for real operation. But everything is real no tricks to make demo look better than real Line-level indexing according to the precision-recall trade-off model: Rather than exact searching, search is carried out with a confidence threshold, specified by the user as part of the query in order to meet the required precision-recall trade-off Word confidence scores are based on pixel-level probabilities and computed for line-shaped regions. Spotted word positions are marked only approximately Two tasks: AUSTEN: Trained on Austen (50p), 20K words open vocabulary. Demo on the whole Juvenile volume The Third (128 pages) PLANTAS: Trained on Plantas (224p), 21K words open vocabulary. Demo on Volume I (about 1 000 pages) J.A. Sánchez PRHLT/UPV Page 18

Index 1 Handwritten Text Recognition (HTR) and Indexing 1 2 The transcriptorium Project 3 3 Selected Handwritting Datasets 5 4 HTR and Interactive-Predictive HTR 10 5 Interactive HTR: Transcription Demonstration 12 6 HTR and Interactive-Predictive HTR Results 14 7 Handwritten Text Images Indexing: Search Demonstration 16 8 Handwritten Text Images Indexing and Search Results 19 9 Conclusion 22 J.A. Sánchez PRHLT/UPV Page 19

Indexing and Search for Handwritten Text Images: Pixel-level Posteriorgram P X Pixel-level posterior probabilities P for a text image X and word v = matter. An accurate, contextual (n-gram based) word classifier was used to compute P. This helped to achieve very low posteriors in a region of X around (i = 100, j = 200), where a very similar word, matters, is written. J.A. Sánchez PRHLT/UPV Page 20

Results on transcriptorium Data Sets 1 Average Precision (AP) Mean Average Precision (MAP) and Recall-Precision curves Datasets training and test details Precision: π 0.8 0.6 0.4 0.2 0 Bentham Plantas Austen-B Austen Bentham: ap=0.88, map=0.94 Plantas: ap=0.89, map=0.92 Austen-B: ap=0.76, map=0.65 Austen: ap=0.91, map=0.77 0 0.2 0.4 0.6 0.8 1 Recall: ρ BENTHAM: Multi-hand. Training: 400 pg. from Bentham, 87 char. HMMs, 2-gram LM trained on Bentham texts; Lexicon 9 341 tokens. Test: 33 pages; query set: 6 962 keywords AUSTEN-B: Single hand. No training; using Bentham char. HMMs, lexicon and LM. Test: 78 pages; query set: 9 000 keywords AUSTEN: Single hand. Training: 50 Austen pages, 81 char. HMMs, 2-gram LM trained on Austen texts; Lexicon 20K tokens. Test: 78 pages; query set: 9 000 keywords J.A. Sánchez PRHLT/UPV Page 21

Index 1 Handwritten Text Recognition (HTR) and Indexing 1 2 The transcriptorium Project 3 3 Selected Handwritting Datasets 5 4 HTR and Interactive-Predictive HTR 10 5 Interactive HTR: Transcription Demonstration 12 6 HTR and Interactive-Predictive HTR Results 14 7 Handwritten Text Images Indexing: Search Demonstration 16 8 Handwritten Text Images Indexing and Search Results 19 9 Conclusion 22 J.A. Sánchez PRHLT/UPV Page 22

Conclussions and Future Work Conclusions Automatic or assited handwritten text transcription and fully automatic indexing is now becoming perfectly feassible Models trained for a given collection can provide quite useful performance on images from other similar collections, without need of (re-training) Several demonstrators have been implemented and made publicly available for first-hand experience in real use; see: http://transcriptorium.eu/demonstrations On-going and future work Research to overcome the line-detection bottelneck Indexing and search experiments with massive handwritting document collections (thousands to millions page images) J.A. Sánchez PRHLT/UPV Page 23

Part 2 Introduction to the Transcription and Recognition Platform - TRANSKRIBUS 02/03/2015 3

TRANSKRIBUS enables collaboration among humanities scholars, computer scientists, archives and volunteers with the ultimate goal to revolutionise recognition, transcription and access to historical handwritten documents. 02/03/2015 4

ARCHIVES & LIBRARIES HUMANITIES SCHOLARS COMPUTER SCIENTISTS PUBLIC USERS

ARCHIVES & LIBRARIES HUMANITIES SCHOLARS TRANSCRIPITON & RECOGNITON PLATFORM COMPUTER SCIENTISTS PUBLIC USERS

ARCHIVES & LIBRARIES Provide images Receive standard formats (METS/ALTO/TEI/PDF-A) DOCUMENTS HUMANITIES SCHOLARS TRP COMPUTER SCIENTISTS PUBLIC USERS

ARCHIVES & LIBRARIES Provide images Receive standard formats (METS/ALTO/TEI/PDF-A) Transcribe text DOCUMENTS HUMANITIES SCHOLARS Receive TEI, PDF EXPERT & CROWD CLIENTS TRP COMPUTER SCIENTISTS PUBLIC USERS

ARCHIVES & LIBRARIES Provide images Receive standard formats (METS/ALTO/TEI/PDF-A) Transcribe text DOCUMENTS Work with reference data HUMANITIES SCHOLARS Receive TEI, PDF EXPERT & CROWD CLIENTS TRP HTR DIA KWS NLP, HPC Invent algorithms COMPUTER SCIENTISTS PUBLIC USERS

ARCHIVES & LIBRARIES Provide images Receive standard formats (METS/ALTO/TEI/PDF-A) Transcribe text DOCUMENTS Work with reference data HUMANITIES SCHOLARS Receive TEI, PDF EXPERT & CROWD CLIENTS TRP HTR DIA KWS NLP, HPC Improve algorithms COMPUTER SCIENTISTS WEBSITE Annotate, evaluate, contribute Search and retrieve handwritten documents PUBLIC USERS

ARCHIVES & LIBRARIES Provide images Receive standard formats (METS/ALTO/TEI/PDF-A) HUMANITIES SCHOLARS Transcribe text Receive TEI, PDF TRANSKRIBUS TRANSCRIPITON RECOGNITION PLATFORM Work with reference data Improve algorithms COMPUTER SCIENTISTS Annotate, evaluate, contribute Search and retrieve handwritten documents PUBLIC USERS

Cycle of growth HTR RECOGNITION ACCELERATED DIGITISATION NEW SERVICES MORE TRANSCRIPTS MORE USERS 02/03/2015 12

Architecture Integrated Tools External Services Images Document Management User Management TRP Expert GUI Crowd- Sourcing GUI METS ALTO PDF TEI RTF PAGE Website Search Interface 02/03/2015 13

Archives/collection holders are enabled to Manage the transcription process of large and heterogeneous document collections in a standardized and effective way Expose their digitised documents to their own employees, humanities scholars, volunteers and the crowd Involve users according to their background Support humanities scholars in their research Outsource transcription (or the training of an HTR model) to service providers (e.g. off-shore) Make large amounts of handwritten documents full-text searchable (if a trained HTR model is available) Import transcribed documents in machine readable formats (METS/ALTO) Provide standardized requirements to researchers and technology providers dealing with issues of interest (e.g. manage tendering) 02/03/2015 14

Humanities scholars are enabled to (1) Use TRANSKRIBUS for free (registration) Upload as many single documents as they want Transcribe their documents semi-automated in a way that image and text are linked to each other (improved transcription quality!) Enrich their documents with Named Entities (person and geo names, dates) and personal tags Normalize their transcription in a transparent way (abbreviations, character sets, editorial declaration) Export their document in various formats TEI: transcribed text with machine and human readable text ( scientific style ) PDF-Text: transcribed text with some formatting ( book style ) PDF-Image: transcribed text in the background, image in the foreground ( Scanned PDF style ) RTF: transcribed text for human readability ( working style ) METS/ALTO: full machine readable package ( digital library style ) METS/PAGE: full machine readable package ( computer science style ) 02/03/2015 15

Humanities scholars are enabled to (2) Manage their documents in a private collection (no one else has access!) Invite other users to collaborate (e.g. students, colleagues) and to work on the same document See different versions of their document and compare them to each other Make their documents available to every registered user (=crowd-sourcing scenario) Expose their documents also in a simplified web-based transcription GUI (TSX) Transcribe documents with the support of HTR if a model is available (needs offline training in beforehand) Exploit language resource database in the background (e.g. words, variants, frequencies, etc.) Search within a handwritten text collection Benefit from other TRANSKRIBUS users in an indirect way. The following resources will be available to every user of the platform: Trained HTR models Normalized editorial declarations Normalized Named Entities Language resources 02/03/2015 16

Computer scientists are enabled to Benefit from ongoing transcription work in the platform Access large amounts of highly valuable data (transcribed text with segmentation) in standardized format (PAGE) Download documents (if they are public domain, if the document owner has agreed, or if just small random sets are used) Investigate new methods, algorithms and tools on the basis of real world data Have a tool available with which humanities scholars and archives can express their requirements in a highly standardized way Use several web-services for integrating their tools into the platform Expose their research results/tools/methods within the platform and increase their reputation among the several communities (humanities, public, archives) Compare their results with other research groups (competitions) on the basis of a common document set Manage the generation of ground truth (reference data) in an effective way Collect and gain feedback from different user groups 02/03/2015 17

Volunteers and crowd-users are enabled to Work with the same tools and in the same way as professional researchers Make a valuable contribution by enriching archival collections (crowd-sourcing) Take part in citizen science projects (e.g. produce Ground Truth ) Benefit from easy-to-use web-based interfaces for low-barrier participation (TSX) Upload their own family documents Benefit from state-of-the-art technology in Computer Vision, Document Layout Analysis, HTR, OCR, etc. Expose their documents to experts, service providers, etc. Search in the full-text of handwritten documents 02/03/2015 18

Business model Free usage Single documents (who ever wants to work with the platform) E.g. humanities scholars, volunteers, genealogists, archivists, Service Level Agreement Document collections E.g. transcription projects (grants), archives, libraries (OCR collections) Yearly fee depends on the amount of documents and services Public-Public-Partnership Subsidiary model, not only for TRANSKRIBUS itself, but also for participants (e.g. archives, collection holders) E.g. DARIAH, direct support via e-infrastructure funds, research agencies (may have interest on standardized formats and workflows, etc.) Vision Hundreds of humanities scholars, thousands of volunteers, dozens of archives, 02/03/2015 19

Next steps Collect feedback Platform and GUIs are in a usable form, but many important features are still not included First real user tests start with this workshop Spread the idea Several workshops are planned in UK, Austria, Germany, Netherlands, etc. Attract as many humanities scholars and volunteers as possible to try out the platform and to integrate it into their research Address archives and collection holders and convince them to enter service level (pre-)agreement with University of Innsbruck / TRANSKRIBUS How to contribute? Try out the tools and give us feedback! Upload your documents and transcribe 100 pages for training the HTR Invite us for workshops, presentations, webinars Spread the idea! 02/03/2015 20

Thank you for your attention! 02/03/2015 21