How To Transcribe Documents with Transkribus Simple Mode

Similar documents
How To enrich transcribed documents with mark-up

Graduate Application Review Process Documentation

Using the Swiftpage Connect List Manager

REFWORKS: STEP-BY-STEP HURST LIBRARY NORTHWEST UNIVERSITY

ClassFlow Administrator User Guide

Date: October User guide. Integration through ONVIF driver. Partner Self-test. Prepared By: Devices & Integrations Team, Milestone Systems

Integrating QuickBooks with TimePro

MyUni Adding Content. Date: 29 May 2014 TRIM Reference: D2013/ Version: 1

TECHNICAL REQUIREMENTS

Using the Swiftpage Connect List Manager

These tasks can now be performed by a special program called FTP clients.

Adverse Action Letters

Enabling Your Personal Web Page on the SacLink

Gmail and Google Drive for Rutherford County Master Gardeners

Access the site directly by navigating to in your web browser.

TRAINING GUIDE. Overview of Lucity Spatial

Online Banking for Business USER GUIDE

istartsmart 3.5 Upgrade - Installation Instructions

Quick start guide: Working in Transit NXT with a PPF

Element Creator for Enterprise Architect

SmartPass User Guide Page 1 of 50

BANNER BASICS. What is Banner? Banner Environment. My Banner. Pages. What is it? What form do you use? Steps to create a personal menu

Getting Started with DocuSign

Secure File Transfer Protocol (SFTP) Interface for Data Intake User Guide

Entering an NSERC CCV: Step by Step

161 Forbes Road Braintree MA Phone: (781) Fax: (781) What's in it? Key Survey & Extreme Form

Getting Started with the Web Designer Suite

Relius Documents ASP Checklist Entry

UnivRS Information Guide: CV Activities and Contributions

PAY EQUITY HEARINGS TRIBUNAL. Filing Guide. A Guide to Preparing and Filing Forms and Submissions with the Pay Equity Hearings Tribunal

Kaltura Video Extension for IBM Connections User Guide. Version: 1.0

Managing Your Access To The Open Banking Directory How To Guide

Outlook Web Application (OWA) Basic Training

Agent Online. User Manual

Proper Document Usage and Document Distribution. TIP! How to Use the Guide. Managing the News Page

BI Publisher TEMPLATE Tutorial

EndNote Online. 1: Introduction... 1 Registering for EndNote Online... 2

Quick Start Guide. Basic Concepts. DemoPad Designer - Quick Start Guide

IFSP PDF Upload/Download Guidance

WorldShip PRE-INSTALLATION INSTRUCTIONS: INSTALLATION INSTRUCTIONS: Window (if available) Install on a Single or Workgroup Workstation

User Guide. Document Version: 1.0. Solution Version:

EBSCOhost User Guide Print/ /Save. Print, , Save, Notetaking, Export, and Cite Your Search Results. support.ebsco.com

INSTALLING CCRQINVOICE

Manual for installation and usage of the module Secure-Connect

UiPath Automation. Walkthrough. Walkthrough Calculate Client Security Hash

OASIS SUBMISSIONS FOR FLORIDA: SYSTEM FUNCTIONS

Welcome to Remote Access Services (RAS) Virtual Desktop vs Extended Network. General

Introduction to Adobe Premiere Pro for Journalists:

Municode Website Instructions

CROWNPEAK DESKTOP CONNECTION (CDC) INSTALLATION GUIDE VERSION 2.0

Procurement Contract Portal. User Guide

ONTARIO LABOUR RELATIONS BOARD. Filing Guide. A Guide to Preparing and Filing Forms and Submissions with the Ontario Labour Relations Board

CLIC ADMIN USER S GUIDE

AvePoint Pipeline Pro 2.0 for Microsoft Dynamics CRM

Wave IP 4.5. CRMLink Desktop User Guide

Paraben s Phone Recovery Stick

FedVTE Training Advisor Guide

1on1 Sales Manager Tool. User Guide

Qualtrics Instructions

RISKMAN REFERENCE GUIDE TO USER MANAGEMENT (Non-Network Logins)

STANLEY Healthcare University Training & Certification Portal. Quick Reference Guide

The Login Page Designer

InformationNOW Letters

Network Rail ARMS - Asbestos Risk Management System. Training Guide for use of the Import Survey Template

Project Extranet User Guide

Ephorus Integration Kit

Installing and using QGIS

Frequently Asked Questions

ClubRunner. Volunteers Module Guide

OO Shell for Authoring (OOSHA) User Guide

User Guide. Avigilon Control Center Mobile Version 2.2 for Android

Dashboard Extension for Enterprise Architect

Element Creator for Enterprise Architect

Microsoft Excel Extensions for Enterprise Architect

Getting Started with the SDAccel Environment on Nimbix Cloud

Your Project Plan and Smartsheet

Mission Antyodaya Android Mobile & Web Application. Frequently Asked Questions

VISITSCOTLAND - TOURS MANAGEMENT SYSTEM Manual for Tour Operators

SoilCare: Stakeholder Platform Guidance How to edit and manage your own stakeholder platform WP8

Faculty Textbook Adoption Instructions

ROCK-POND REPORTING 2.1

Exporting and Importing the Blackboard Vista Grade Book

AvePoint Timeline Enterprise for Microsoft Dynamics CRM

NiceLabel LMS. Installation Guide for Single Server Deployment. Rev-1702 NiceLabel

If you have any questions that are not covered in this manual, we encourage you to contact us at or send an to

CREATING A DONOR ACCOUNT

Interfacing to MATLAB. You can download the interface developed in this tutorial. It exists as a collection of 3 MATLAB files.

Imagine for MSDNAA Student SetUp Instructions

Campuses that access the SFS nvision Windows-based client need to allow outbound traffic to:

Using the Turnpike Materials ProjectSolveSP System (Materials & ProjectSolveSP Admin)

INSERTING MEDIA AND OBJECTS

WordPress Overview for School Webmasters

August 22, 2006 IPRO Tech Client Services Tip of the Day. Concordance and IPRO Camera Button / Backwards DB Link Setup

STIDistrict AL Rollover Procedures

TaskCentre v4.5 XML to Recordset Tool White Paper

Employee Self Service (ESS) FAQs

Scroll down to New and another menu will appear. Select Folder and a new

Copyrights and Trademarks

Using MeetingSquared as an Administrator

Stealing passwords via browser refresh

Transcription:

Hw T Transcribe Dcuments with Transkribus Simple Mde This is a shrt intrductin t the basic steps fr transcribing dcuments with Transkribus. This platfrm is specifically designed t enable users t generate highly standardized utput. There are varius ptins fr transcribing dcuments and transcripts can als be used t train Handwritten Text Recgnitin sftware. Fr further infrmatin see the fllwing papers and websites: Hw T Transcribe Dcuments with Transkribus Advanced Mde Hw T Prepare Test Prjects with Transkribus fr Archives and Libraries Dwnlad the Transkribus Expert Client, r make sure yu are using the latest versin: https://transkribus.eu/ Cnsult the Transkribus Wiki fr further infrmatin and a Users Guide: https://transkribus.eu/wiki/ https://transkribus.eu/wiki/index.php/users_guide Transkribus and the technlgy behind it are made available via the fllwing prjects and sites: https://read.transkribus.eu/ https://transcriptrium.eu/ https://github.cm/transkribus/ Cntact The Transkribus Team: email@transkribus.eu

2 Cntents Intrductin... 3 Getting started... 3 Benefits... 3 General rules... 4 Learning by ding... 5 Uplad Example Package t Transkribus... 7 Segmentatin... 9 Intrductin... 9 Viewing mdes... 11 Step 1: Define text regins... 12 Step 2: Define lines/baselines... 13 Tables... 15 Transcriptin... 16 Intrductin... 16 Transcribe text... 16 Text mark up... 16 Additins... 17 Credits... 17 The READ prject has received funding frm the Eurpean Unin s Hrizn 2020 research and innvatin prgramme under grant agreement N 674943.

3 Intrductin Getting started Everyne can easily learn hw t transcribe histrical dcuments with Transkribus. Dwnlad Transkribus frm the fllwing website: https://transkribus.eu/ Further instructins n installing the platfrm can be fund in the Transkribus Wiki Once Transkribus has been installed, pen the platfrm and click the Lgin buttn n the Main Menu. Lg in using the email address and passwrd yu used when registering yur accunt Figure 1 Lgin buttn Detailed backgrund infrmatin can be fund in the Transkribus Wiki and the User s Guide. Thugh Transkribus is an expert prgramme yu will get familiar within a shrt while with the basic functinality needed t uplad dcuments, t segment, transcribe and exprt them. Transkribus is still in develpment and therefre yu may discver sme bugs r features that can be imprved. D nt hesitate t use the Bug Reprt and Feature Request buttn in Transkribus we are grateful fr every kind f feedback! Figure 2 Bug Reprt and Feature Request buttn Benefits The main benefits f Transkribus are twfld: First f all the data can serve as training data fr the Handwritten Text Recgnitin (HTR) engines which are als part f the Transkribus platfrm. The HTR engines can learn a specific type f script nce they have access t a few dzen pages f crrectly transcribed training data. Let s assume yu have a cllectin f 100, 1000 r 10,000 letters and yu wuld like t prcess them autmatically. This means that yu will receive an autmatically transcribed text but als be able t search thrughut yur dcument cllectin. Transcripts generated with Transkribus are a first step t achieving this utcme. Secnd, the transcriptins in Transkribus can be used as a basis fr a schlarly editin f a dcument r dcument cllectin. The transcriptin can be exprted at any time as XML, TEI (Text Encding Initiative), PDF r Wrd file. Mrever the dcuments can als be made available via webservices s that a seamless cnnectin can be generated between several platfrms. And in the nt s distant future, Transkribus will als supprt nline access t transcribed dcuments. Anther benefit which can be mentined here is that crrect transcriptins will in the future als serve as learning material fr students r vlunteers wh are interested in practising the crrect transcriptin f histrical texts. A specific interface will supprt this use case as well.

4 General rules There are nly a few simple rules which yu shuld bear in mind befre starting with yur transcriptin: 1. Segmentatin = Cnnect text and image via the baseline In Transkribus it is always necessary t cnnect the transcript and the image tgether. The HTR engines must be able t match each line f the transcript with its crrespnding line in the image. T achieve this, each image must be segmented int text regins, lines and baselines. This prcess is called segmentatin and can be dne manually r with the supprt f layut analysis tls which are integrated in Transkribus. Figure 3 Line in the canvas (yellw line) and transcribed text in the Text Editr (blue line) always need t be linked 2. Transcriptin = transcribe what yu see The transcriptin shuld fllw the graphical appearance f the text (glyphs) and neither add, nr mit text in the transcriptin. Capital letters shuld be transcribed as capital letters, special characters as special characters, abbreviatins as abbreviatins and s n.

5 Figure 4 Transcribe what yu see. E.g. abbreviated wrds are transcribed as they appear in the dcument (simple mde) r expanded with the abbreviatin tag (advanced mde) If yu fllw these tw simple rules yur transcriptin will be suitable fr all three use cases: (1) Training the HTR engines, (2) Preparing dcuments fr a schlarly editin and (3) Generating learning resurces fr students r vlunteers. Learning by ding When reading these instructins yu shuld lad the Example Package which we have prepared fr yu. Fllw this link and dwnlad the zip file: https://transkribus.eu/wiki/images/d/d6/example_package.zip Figure 5 Images frm the Example Package The example package cnsists f the three pages shwn abve: Unzip the zip file

6 Yu will see a flder called Example_Package which cntains als a flder page. In this flder yu will see the XML files where transcripts and related infrmatin is stred. Use the Open lcal flder buttn n Transkribus t pen the flder frm yur cmputer. Figure 6 "Open lcal flder" and lad the Transkribus Example Package Figure 7 Select the flder: Example_Package The Example Package cntains the fllwing dcuments: Page 1: Example page fr the simple mde f transcriptin A typical layut with running text and marginalia Page 2 and 3 A mre sphisticated layut Interline additins Special characters frm Latin Extended Character Sets Tagged entities, such as persn names, dates, etc.

7 Figure 8 Example Package pened as lcal dcument Uplad Example Package t Transkribus In rder t be able t run the necessary tls n yur dcuments they need t reside n the Transkribus server. This means that yu need t uplad the Example Package t Transkribus. In rder t wrk with yur wn dcuments, yu will als need t uplad them t the Transkribus server. Nte: All cllectins and dcuments in Transkribus are private. Only users authrised by yu are able t see yur dcuments. They are nt made available t the public. Uplading a dcument t the Transkribus server is therefre a purely technical prcess. Uplading dcuments t the Transkribus server is simple. Open the uplad buttn in the Dcument tab.

8 Figure 9 Uplad the Example Package r yur wn image files t yur persnal cllectin Figure 10 Select "Uplad single dcument" fr dcuments up t 500 MB Yu have three ptins: Uplad via http frm a lcal flder: This is suitable fr uplading a few dcuments which have a cmbined size f less than 500 MB. We will be using this ptin in these instructins. Uplad via FTP This is suitable if yu want t uplad several dcuments, r dcuments f mre than 500 MB Uplad via URL f DFG Viewer METS This allws yu t uplad dcuments directly frm repsitries which supprt the DFG (Deutsche Frschungsgemeinschaft German Science Funds) Viewer

9 Nte: it is nt currently pssible t uplad images as single PDF files. Befre uplading t Transkribus, yu shuld first extract the image files frm the PDF files. Yu can d this with specific sftware, e.g. Adbe Acrbat Prfessinal. T uplad the Example Package: Click n Ingest r uplad dcuments Select Uplad single dcument Use the Lcal flder sectin t find the Example Package n yur cmputer Select an already available cllectin frm the drp dwn menu, r create yur wn cllectin. Write the name f the cllectin yu want t create int the Create cllectin field, here: guenters_cllectin Press the green + Buttn Select the new cllectin frm the drp dwn menu abve and click Uplad Figure 11 Create yur wn cllectin by writing the title (here: guenters_cllectin) int the field and press the green + buttn. Then select the new cllectin frm drp dwn menu abve Uplading may take several minutes depending n yur Internet cnnectin. Segmentatin Intrductin Fr the HTR t wrk, the text and image need t be cnnected in Transkribus. This is achieved by segmenting each dcument int: Text regins (TR): The text regin must cntain all the relevant text which shall be transcribed. Lines (L): The line regin is here fr technical reasns and des nt play a rle fr the enduser.

10 Baselines (B): The baselines are very imprtant. They need t be crrect because they are the basis fr bth training the HTR and applying HTR mdels (i.e. recgnitin). These segmented regins are knwn as elements. The prcess f dividing a page int these elements is called segmentatin r layut analysis. Segmentatin can be dne manually r perfrmed autmatically by Transkribus. Figure 12 The green rectangle indicates the text regin. The text regin needs t be crrect Figure 13 The blue plygn represents the line regin. It is NOT necessary t crrect the line regin Figure 14 The red plyline indicates the baseline. The baseline needs t be crrect Segmentatin elements in Transkribus have the fllwing features: Segmentatin elements in Transkribus can be either rectangles r plygns. The default mde is t use rectangles but yu can easily switch t using plygnal elements. The baseline is the nly segmentatin element which cnsists f just a plyline (i.e. a line with several pints). Segmentatin elements in Transkribus can verlap with each ther. In handwritten dcuments it is ften the case that the writing des nt fllw strict rules, e.g. marginalia and running text are ften nt clearly separated. Segmentatin elements in Transkribus fllw a hierarchical rder: A baseline needs t be part f a line regin, a line regin needs t be part f a text regin. E.g. If yu add a baseline withut having defined a text regin befrehand Transkribus will ask yu if it shuld als generate the missing parent element. Nevertheless we have made it simple fr yu t wrk with this hierarchy: First, yu need t define (r crrect) the text regins. Secnd, yu need t

11 define (r crrect) the baselines. That s all that needs t be dne. A single page can be cmpleted within a few minutes, r even quicker! Viewing mdes Befre starting t try ut the features in Transkribus yu shuld be familiar with the Viewing mdes which are ffered in the platfrm. We have prepared tw Viewing mdes fr yu, ne fr the Segmentatin task, ne fr the Transcriptin task. Yu can als cnfigure and stre yur preferred viewing mde in Transkribus. Yu can select the Viewing mdes frm the Main Menu. They are called Segmentatin View and Transcriptin View. Figure 15 Viewing mdes fr segmentatin and transcriptin tasks If yu select the Segmentatin View The Text Editr field will disappear The lines f text regins and baselines will be thick s they are easy t see Text regins will be displayed in green, baselines in red. Line regins will nt be displayed The rectangular mde will be turned n, i.e. text regins will be rectangles by default. The pints defining a line r a rectangle will be large s that they can be mved easily in rder t change the shape f each segmentatin element. Figure 16 Segmentatin View f the example page If yu select the Transcriptin View

12 The Text Editr field will be displayed. The lines f the segmentatin elements will be thin and the pints defining these elements will be small. The cluring f the baseline will change frm red t a faint yellw This shuld make it easier t read the text in the dcument image. Figure 17 Transcriptin View f the same line Step 1: Define text regins Select the Segmentatin View frm the Main Menu Select the Add a text regin buttn Figure 18 Add a text regin with the +TR buttn Click n the tp left crner f a blck f text and then click n the bttm right crner Text regins shuld represent cherent parts f the text they can cntain several paragraphs The rder in which yu define the text regins will als be the rder in which they are shwn in the Structure Tab. Yu can edit the rder with the Reading rder buttn in the Main Menu. Nte: the Text regin shuld be clse t the actual lines f the text. Nte: Decrative characters r initials d nt need t be included in the Text regin.

13 Nte: Currently it is faster t define the text blcks manually especially if a high level f accuracy is necessary. Figure 19 Text regins manually added (rectangles) Text which will nt appear in the transcriptin r which will nt be used as training data fr the HTR engine can be left ut. This means that yu d nt mark it as Text regin, nr d yu mark it with lines/baselines. Step 2: Define lines/baselines Stay in the Segmentatin View mde Select the Tls Tab in Transkribus Run the Detect lines and baselines tl (secnd frm abve) frm the Tls Tab.

14 Figure 20 Line/baselines autmatically generated with the "Detect lines and baselines" tl Review and crrect the results f the Line/Baseline segmentatin. The baseline (the thick red line at the bttm f the red plyrectangle) shuld be clse t the actual characters. The characters shuld sit n the baseline exactly in the way yu have learned it when yu were a schl pupil in Primary Schl ;) T crrect the baseline, click and drag the dts n the baseline Nte: It is sufficient t review/edit/crrect the baseline. Line regins d nt need t be crrected. Nte: The line/baseline segmentatin tl smetimes prduces lng baselines ging far beynd the actual text. Such baselines shuld be crrected. In such cases yu may select Remve pint frm selected plygn. Nte: If yu discver errrs, it is ften easier t delete the baseline and redraw it. T d this, select the line regin and press the Delete key n yur keybard. Bth the line regin and the baseline will be deleted. When yu redraw a baseline, Transkribus will autmatically generate a crrespnding (parent) line regin T draw a baseline, click the +BL buttn T create a straight line click at the start f the line f text, mve yur muse alng the line and duble click t finish T create a crked line click at the start f the line f text, mve yur muse alng, click again t change angle, cntinue t mve alng and duble click t finish. T und any manual segmentatin press the green backwards arrw buttn.

15 Figure 21 Errneus line/baseline frm the autmated detectin Figure 22 Crrected line/baseline (deleted and manually added with +BL buttn) Tables Tables can als be handled in the simple mde if yu just want t train the HTR engine r create learning resurces. Just draw text regins acrss the table itself r acrss rws r clumns and segment the baselines in the way described abve. Nte: Currently the autmatic layut analysis des nt prduce useable results fr tables. In the curse f the READ prject we will develp a Table Recgnitin Tl where users will be able t edit tables in a mre cnvenient way. We will prvide a prttype f such an editr at the end f 2016.

16 Transcriptin Intrductin The main purpse f any transcriptin is t capture all the infrmatin available in a dcument. Transkribus supprts UTF8 and stres all characters in Unicde. A crrect diplmatic transcriptin is the basis fr this. Nevertheless there is als hidden infrmatin, such as emphasized wrds (underlined, bld), ntes which were added at a later time, r abbreviatins which need t be expanded in rder t understand the cntent f the dcument. All this can be marked as well. Transcribe text Select the Transcriptin View frm the Main menu Yu will see the Text Editr field: Fr each line/baseline in the image yu will find a crrespnding line in the Text Editr. The image and the text are cnnected in this way. Transcribe the text accrding t the language f yur surce dcument. Use the characters f yur keybard. Text mark up Typical markup f text can be fund in the Metadata Tab. There yu can select frm a range f markup settings: Bld Underlined Strike thrugh Superscript Text clur Etc. Figure 23 Fr markup, select the Metadata Tab and select frm the ptins in the Text style field Mst f this markup is directly displayed in the Text Editr field.

17 Hyphenated wrds at the end f the line shuld be indicated with. Additins Additins, especially interline additins need nt t be handled in a specific way in the simple mde. Yu shuld just transcribe exactly what yu see. Nte: If yu exprt the transcriptin t a Wrd r TEI file, the reading rder f yur dcument may be incrrect. Fr training the HTR engine this des nt make a difference. Credits We wuld like t thank the many users wh have cntributed their feedback t help imprve the Transkribus sftware. Transkribus is made available t the public as part f H2020 einfrastructure Prject READ (Recgnitin and Enrichment f Archival Dcuments) which received funding frm the Eurpean Cmmissin under grant agreement N 674943.