Hw T Transcribe Dcuments with Transkribus Simple Mde This is a shrt intrductin t the basic steps fr transcribing dcuments with Transkribus. This platfrm is specifically designed t enable users t generate highly standardized utput. There are varius ptins fr transcribing dcuments and transcripts can als be used t train Handwritten Text Recgnitin sftware. Fr further infrmatin see the fllwing papers and websites: Hw T Transcribe Dcuments with Transkribus Advanced Mde Hw T Prepare Test Prjects with Transkribus fr Archives and Libraries Dwnlad the Transkribus Expert Client, r make sure yu are using the latest versin: https://transkribus.eu/ Cnsult the Transkribus Wiki fr further infrmatin and a Users Guide: https://transkribus.eu/wiki/ https://transkribus.eu/wiki/index.php/users_guide Transkribus and the technlgy behind it are made available via the fllwing prjects and sites: https://read.transkribus.eu/ https://transcriptrium.eu/ https://github.cm/transkribus/ Cntact The Transkribus Team: email@transkribus.eu
2 Cntents Intrductin... 3 Getting started... 3 Benefits... 3 General rules... 4 Learning by ding... 5 Uplad Example Package t Transkribus... 7 Segmentatin... 9 Intrductin... 9 Viewing mdes... 11 Step 1: Define text regins... 12 Step 2: Define lines/baselines... 13 Tables... 15 Transcriptin... 16 Intrductin... 16 Transcribe text... 16 Text mark up... 16 Additins... 17 Credits... 17 The READ prject has received funding frm the Eurpean Unin s Hrizn 2020 research and innvatin prgramme under grant agreement N 674943.
3 Intrductin Getting started Everyne can easily learn hw t transcribe histrical dcuments with Transkribus. Dwnlad Transkribus frm the fllwing website: https://transkribus.eu/ Further instructins n installing the platfrm can be fund in the Transkribus Wiki Once Transkribus has been installed, pen the platfrm and click the Lgin buttn n the Main Menu. Lg in using the email address and passwrd yu used when registering yur accunt Figure 1 Lgin buttn Detailed backgrund infrmatin can be fund in the Transkribus Wiki and the User s Guide. Thugh Transkribus is an expert prgramme yu will get familiar within a shrt while with the basic functinality needed t uplad dcuments, t segment, transcribe and exprt them. Transkribus is still in develpment and therefre yu may discver sme bugs r features that can be imprved. D nt hesitate t use the Bug Reprt and Feature Request buttn in Transkribus we are grateful fr every kind f feedback! Figure 2 Bug Reprt and Feature Request buttn Benefits The main benefits f Transkribus are twfld: First f all the data can serve as training data fr the Handwritten Text Recgnitin (HTR) engines which are als part f the Transkribus platfrm. The HTR engines can learn a specific type f script nce they have access t a few dzen pages f crrectly transcribed training data. Let s assume yu have a cllectin f 100, 1000 r 10,000 letters and yu wuld like t prcess them autmatically. This means that yu will receive an autmatically transcribed text but als be able t search thrughut yur dcument cllectin. Transcripts generated with Transkribus are a first step t achieving this utcme. Secnd, the transcriptins in Transkribus can be used as a basis fr a schlarly editin f a dcument r dcument cllectin. The transcriptin can be exprted at any time as XML, TEI (Text Encding Initiative), PDF r Wrd file. Mrever the dcuments can als be made available via webservices s that a seamless cnnectin can be generated between several platfrms. And in the nt s distant future, Transkribus will als supprt nline access t transcribed dcuments. Anther benefit which can be mentined here is that crrect transcriptins will in the future als serve as learning material fr students r vlunteers wh are interested in practising the crrect transcriptin f histrical texts. A specific interface will supprt this use case as well.
4 General rules There are nly a few simple rules which yu shuld bear in mind befre starting with yur transcriptin: 1. Segmentatin = Cnnect text and image via the baseline In Transkribus it is always necessary t cnnect the transcript and the image tgether. The HTR engines must be able t match each line f the transcript with its crrespnding line in the image. T achieve this, each image must be segmented int text regins, lines and baselines. This prcess is called segmentatin and can be dne manually r with the supprt f layut analysis tls which are integrated in Transkribus. Figure 3 Line in the canvas (yellw line) and transcribed text in the Text Editr (blue line) always need t be linked 2. Transcriptin = transcribe what yu see The transcriptin shuld fllw the graphical appearance f the text (glyphs) and neither add, nr mit text in the transcriptin. Capital letters shuld be transcribed as capital letters, special characters as special characters, abbreviatins as abbreviatins and s n.
5 Figure 4 Transcribe what yu see. E.g. abbreviated wrds are transcribed as they appear in the dcument (simple mde) r expanded with the abbreviatin tag (advanced mde) If yu fllw these tw simple rules yur transcriptin will be suitable fr all three use cases: (1) Training the HTR engines, (2) Preparing dcuments fr a schlarly editin and (3) Generating learning resurces fr students r vlunteers. Learning by ding When reading these instructins yu shuld lad the Example Package which we have prepared fr yu. Fllw this link and dwnlad the zip file: https://transkribus.eu/wiki/images/d/d6/example_package.zip Figure 5 Images frm the Example Package The example package cnsists f the three pages shwn abve: Unzip the zip file
6 Yu will see a flder called Example_Package which cntains als a flder page. In this flder yu will see the XML files where transcripts and related infrmatin is stred. Use the Open lcal flder buttn n Transkribus t pen the flder frm yur cmputer. Figure 6 "Open lcal flder" and lad the Transkribus Example Package Figure 7 Select the flder: Example_Package The Example Package cntains the fllwing dcuments: Page 1: Example page fr the simple mde f transcriptin A typical layut with running text and marginalia Page 2 and 3 A mre sphisticated layut Interline additins Special characters frm Latin Extended Character Sets Tagged entities, such as persn names, dates, etc.
7 Figure 8 Example Package pened as lcal dcument Uplad Example Package t Transkribus In rder t be able t run the necessary tls n yur dcuments they need t reside n the Transkribus server. This means that yu need t uplad the Example Package t Transkribus. In rder t wrk with yur wn dcuments, yu will als need t uplad them t the Transkribus server. Nte: All cllectins and dcuments in Transkribus are private. Only users authrised by yu are able t see yur dcuments. They are nt made available t the public. Uplading a dcument t the Transkribus server is therefre a purely technical prcess. Uplading dcuments t the Transkribus server is simple. Open the uplad buttn in the Dcument tab.
8 Figure 9 Uplad the Example Package r yur wn image files t yur persnal cllectin Figure 10 Select "Uplad single dcument" fr dcuments up t 500 MB Yu have three ptins: Uplad via http frm a lcal flder: This is suitable fr uplading a few dcuments which have a cmbined size f less than 500 MB. We will be using this ptin in these instructins. Uplad via FTP This is suitable if yu want t uplad several dcuments, r dcuments f mre than 500 MB Uplad via URL f DFG Viewer METS This allws yu t uplad dcuments directly frm repsitries which supprt the DFG (Deutsche Frschungsgemeinschaft German Science Funds) Viewer
9 Nte: it is nt currently pssible t uplad images as single PDF files. Befre uplading t Transkribus, yu shuld first extract the image files frm the PDF files. Yu can d this with specific sftware, e.g. Adbe Acrbat Prfessinal. T uplad the Example Package: Click n Ingest r uplad dcuments Select Uplad single dcument Use the Lcal flder sectin t find the Example Package n yur cmputer Select an already available cllectin frm the drp dwn menu, r create yur wn cllectin. Write the name f the cllectin yu want t create int the Create cllectin field, here: guenters_cllectin Press the green + Buttn Select the new cllectin frm the drp dwn menu abve and click Uplad Figure 11 Create yur wn cllectin by writing the title (here: guenters_cllectin) int the field and press the green + buttn. Then select the new cllectin frm drp dwn menu abve Uplading may take several minutes depending n yur Internet cnnectin. Segmentatin Intrductin Fr the HTR t wrk, the text and image need t be cnnected in Transkribus. This is achieved by segmenting each dcument int: Text regins (TR): The text regin must cntain all the relevant text which shall be transcribed. Lines (L): The line regin is here fr technical reasns and des nt play a rle fr the enduser.
10 Baselines (B): The baselines are very imprtant. They need t be crrect because they are the basis fr bth training the HTR and applying HTR mdels (i.e. recgnitin). These segmented regins are knwn as elements. The prcess f dividing a page int these elements is called segmentatin r layut analysis. Segmentatin can be dne manually r perfrmed autmatically by Transkribus. Figure 12 The green rectangle indicates the text regin. The text regin needs t be crrect Figure 13 The blue plygn represents the line regin. It is NOT necessary t crrect the line regin Figure 14 The red plyline indicates the baseline. The baseline needs t be crrect Segmentatin elements in Transkribus have the fllwing features: Segmentatin elements in Transkribus can be either rectangles r plygns. The default mde is t use rectangles but yu can easily switch t using plygnal elements. The baseline is the nly segmentatin element which cnsists f just a plyline (i.e. a line with several pints). Segmentatin elements in Transkribus can verlap with each ther. In handwritten dcuments it is ften the case that the writing des nt fllw strict rules, e.g. marginalia and running text are ften nt clearly separated. Segmentatin elements in Transkribus fllw a hierarchical rder: A baseline needs t be part f a line regin, a line regin needs t be part f a text regin. E.g. If yu add a baseline withut having defined a text regin befrehand Transkribus will ask yu if it shuld als generate the missing parent element. Nevertheless we have made it simple fr yu t wrk with this hierarchy: First, yu need t define (r crrect) the text regins. Secnd, yu need t
11 define (r crrect) the baselines. That s all that needs t be dne. A single page can be cmpleted within a few minutes, r even quicker! Viewing mdes Befre starting t try ut the features in Transkribus yu shuld be familiar with the Viewing mdes which are ffered in the platfrm. We have prepared tw Viewing mdes fr yu, ne fr the Segmentatin task, ne fr the Transcriptin task. Yu can als cnfigure and stre yur preferred viewing mde in Transkribus. Yu can select the Viewing mdes frm the Main Menu. They are called Segmentatin View and Transcriptin View. Figure 15 Viewing mdes fr segmentatin and transcriptin tasks If yu select the Segmentatin View The Text Editr field will disappear The lines f text regins and baselines will be thick s they are easy t see Text regins will be displayed in green, baselines in red. Line regins will nt be displayed The rectangular mde will be turned n, i.e. text regins will be rectangles by default. The pints defining a line r a rectangle will be large s that they can be mved easily in rder t change the shape f each segmentatin element. Figure 16 Segmentatin View f the example page If yu select the Transcriptin View
12 The Text Editr field will be displayed. The lines f the segmentatin elements will be thin and the pints defining these elements will be small. The cluring f the baseline will change frm red t a faint yellw This shuld make it easier t read the text in the dcument image. Figure 17 Transcriptin View f the same line Step 1: Define text regins Select the Segmentatin View frm the Main Menu Select the Add a text regin buttn Figure 18 Add a text regin with the +TR buttn Click n the tp left crner f a blck f text and then click n the bttm right crner Text regins shuld represent cherent parts f the text they can cntain several paragraphs The rder in which yu define the text regins will als be the rder in which they are shwn in the Structure Tab. Yu can edit the rder with the Reading rder buttn in the Main Menu. Nte: the Text regin shuld be clse t the actual lines f the text. Nte: Decrative characters r initials d nt need t be included in the Text regin.
13 Nte: Currently it is faster t define the text blcks manually especially if a high level f accuracy is necessary. Figure 19 Text regins manually added (rectangles) Text which will nt appear in the transcriptin r which will nt be used as training data fr the HTR engine can be left ut. This means that yu d nt mark it as Text regin, nr d yu mark it with lines/baselines. Step 2: Define lines/baselines Stay in the Segmentatin View mde Select the Tls Tab in Transkribus Run the Detect lines and baselines tl (secnd frm abve) frm the Tls Tab.
14 Figure 20 Line/baselines autmatically generated with the "Detect lines and baselines" tl Review and crrect the results f the Line/Baseline segmentatin. The baseline (the thick red line at the bttm f the red plyrectangle) shuld be clse t the actual characters. The characters shuld sit n the baseline exactly in the way yu have learned it when yu were a schl pupil in Primary Schl ;) T crrect the baseline, click and drag the dts n the baseline Nte: It is sufficient t review/edit/crrect the baseline. Line regins d nt need t be crrected. Nte: The line/baseline segmentatin tl smetimes prduces lng baselines ging far beynd the actual text. Such baselines shuld be crrected. In such cases yu may select Remve pint frm selected plygn. Nte: If yu discver errrs, it is ften easier t delete the baseline and redraw it. T d this, select the line regin and press the Delete key n yur keybard. Bth the line regin and the baseline will be deleted. When yu redraw a baseline, Transkribus will autmatically generate a crrespnding (parent) line regin T draw a baseline, click the +BL buttn T create a straight line click at the start f the line f text, mve yur muse alng the line and duble click t finish T create a crked line click at the start f the line f text, mve yur muse alng, click again t change angle, cntinue t mve alng and duble click t finish. T und any manual segmentatin press the green backwards arrw buttn.
15 Figure 21 Errneus line/baseline frm the autmated detectin Figure 22 Crrected line/baseline (deleted and manually added with +BL buttn) Tables Tables can als be handled in the simple mde if yu just want t train the HTR engine r create learning resurces. Just draw text regins acrss the table itself r acrss rws r clumns and segment the baselines in the way described abve. Nte: Currently the autmatic layut analysis des nt prduce useable results fr tables. In the curse f the READ prject we will develp a Table Recgnitin Tl where users will be able t edit tables in a mre cnvenient way. We will prvide a prttype f such an editr at the end f 2016.
16 Transcriptin Intrductin The main purpse f any transcriptin is t capture all the infrmatin available in a dcument. Transkribus supprts UTF8 and stres all characters in Unicde. A crrect diplmatic transcriptin is the basis fr this. Nevertheless there is als hidden infrmatin, such as emphasized wrds (underlined, bld), ntes which were added at a later time, r abbreviatins which need t be expanded in rder t understand the cntent f the dcument. All this can be marked as well. Transcribe text Select the Transcriptin View frm the Main menu Yu will see the Text Editr field: Fr each line/baseline in the image yu will find a crrespnding line in the Text Editr. The image and the text are cnnected in this way. Transcribe the text accrding t the language f yur surce dcument. Use the characters f yur keybard. Text mark up Typical markup f text can be fund in the Metadata Tab. There yu can select frm a range f markup settings: Bld Underlined Strike thrugh Superscript Text clur Etc. Figure 23 Fr markup, select the Metadata Tab and select frm the ptins in the Text style field Mst f this markup is directly displayed in the Text Editr field.
17 Hyphenated wrds at the end f the line shuld be indicated with. Additins Additins, especially interline additins need nt t be handled in a specific way in the simple mde. Yu shuld just transcribe exactly what yu see. Nte: If yu exprt the transcriptin t a Wrd r TEI file, the reading rder f yur dcument may be incrrect. Fr training the HTR engine this des nt make a difference. Credits We wuld like t thank the many users wh have cntributed their feedback t help imprve the Transkribus sftware. Transkribus is made available t the public as part f H2020 einfrastructure Prject READ (Recgnitin and Enrichment f Archival Dcuments) which received funding frm the Eurpean Cmmissin under grant agreement N 674943.