The IMP digital library of Slovene written cultural heritage

Size: px

Start display at page:

Download "The IMP digital library of Slovene written cultural heritage"

Caitlin Willis
5 years ago
Views:

1 The IMP digital library of Slovene written cultural heritage Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana SEEDI 2013

2 The IMP digital library 2 Overview 1. Background 2. Scope of the library 3. Encoding and presentation 4. IMP corpus and lexicon 5. Conclusions

The IMP digital library 3 Background 1. AHLib project (2004 08) Deutsch-slowenische/kroatische Übersetzung 1848 1918, AAS / KFU (prof. Erich Prunč) + JSI Slovene books translated from German 2.

3 The IMP digital library 3 Background 1. AHLib project ( ) Deutsch-slowenische/kroatische Übersetzung , AAS / KFU (prof. Erich Prunč) + JSI Slovene books translated from German 2. EU IP IMPACT ( ) Improving Access to Text, NUK (Alenka Kavčič Čolić, Ines Vodopivec) + JSI Slovene publications (books and newspapers) 3. Google award ( ) Computational models for historical Slovene, ZRC SAZU (Matija Ogrin) + JSI Samples of v. old books Funding for Wikisource (Miran Hladnik)

4 The IMP digital library 4 Goals Project goals AHLib: develop a corpus to study translation processes IMPACT: develop hand-corrected corpus for improving OCR and lexicon giving modern equivalents of historical words for improving IR Google: develop computational models for historical Slovene, for better language technologies So, the digital library is actually a side-effect IMP DL axioms: facsimiles + proof-read(!) texts uniformly encoded (TEI P5) no problems with further use or dissemination

5 The IMP digital library 5 IMP DL ( ) Source Units Pages Words AHLib KRN NUK WIKI ZRC Units / source AHLib KRN NUK WIKI ZRC Words / year

6 The IMP digital library 6 Annotation on each unit Meta-data (teiheader): id, responsibility, extent, availability basic bibliographic information (two titles: original, modern) taxonomy: medium (manuscript, book, magazine, newspaper), text type (fiction, non-fiction, religious), test status (original, translated) tag usage, revision description Facsimile: images in several sizes, each page break linked to facsimile Text structure: divisions, headings, lists, tables, notes, poems, figures, line breaks Editorial interventions: sic/corr, foreign

7 The IMP digital library 7 TEI P5 encoding: source description <sourcedesc> <bibl> <title type="main">zlata Vas</title> <title type="alt">zlata Vas</title> <author>zschokke, Heinrich</author> <respstmt> <resp xml:lang="sl">prevajalec</resp> <resp xml:lang="en">translator</resp> <name>malavašič, Fran</name> </respstmt> <date>1850</date> <publisher>natisnil Jožef Blaznik</publisher> <pubplace>v Ljubljani</pubPlace> <extent>109+3</extent> <idno>ds54925 NUK - Narodna in univerzitetna knjižnica</idno> <note type="tradok" xml:lang="de"> <ref target=" <lb/>1850<lb/>4260<lb/>[zschokke, Heinrich] (Autor, erschlossen) Malavašič, Fran

8 The IMP digital library 8 TEI P5 encoding: text body <pb n="[3]" facs="#fpg " xml:id="pb.003"/> <div type="level1" xml:id="div.2"> <head xml:id="head.2">1. <lb/>kako Ožbalt iz vojske domú pride in kaj ljudjé govorijo.</head> <figure xml:id="figure.3"> <figdesc>ornamentna sličica. Okrašena črka v.</figdesc> </figure> <p xml:id="p.7">v nedéljo po poldne je bilo in v Zlati Vasi so mlajši fantini in dekleta pod staro lipo sedéli in peli, ali pa se smejali, kadar jo je kdó iz pivnice prilomil, ki je pregloboko v kozarček polukal. Nekteri kmetje s svojimi ženami so pa v gostivnici sedéli in pri bokalu prav židane volje bili, kakor je že to navada, kadar sta vino in vol po ceni.</p> <p xml:id="p.8">kar jo primaha nék neznan človek v vas. Terdne in velike postave je bil in <choice> <sic>kakil</sic> <corr>kakih</corr> </choice> tridesét lét je mogel iméti; obléčen je bil v sivi suknji, na strani je imel veliko sabljo, na herbtu pa

9 The IMP digital library 9 DL on the Web We use (slightly modified) TEI XSLT stylesheets to convert TEI to HTML Each unit is one HTML file, showing both the facsimile and (typeset) transcription Indexes to books are by taxonomy, sorted by (one of) author title date signature One index also shows title pages

10 Example book: front The IMP digital library 10

11 Example book: body The IMP digital library 11

12 Example index The IMP digital library 12

13 Example index The IMP digital library 13

14 The IMP digital library 14 DL as corpus We have also developed a tool to automatically: 1. Tokenise the text (split it into words, punctuation and whitespace) 2. Modernise the words in the text 3. Tag the words with morphosyntactic descriptions 4. Lemmatise the words Example TEI encoding Ako se ne združimo tiga[tega] kužniga[kužnega]... <s> <w lemma="ako" ana="cs">ako</w><c> </c> <w lemma="se" ana="px------y">se</w><c> </c> <w lemma="ne" ana="q">ne</w><c> </c> <w lemma="združiti" ana="vmer1p">združimo</w><c> </c> <choice> <orig><w>tiga</w></orig> <reg><w lemma="ta" ana="pd-msg">tega</w></reg> </choice> <c> </c> <choice> <orig><w>kužniga</w></orig> <reg type="pattern" n="[ega@ iga@]"><w lemma="kužen" ana="agpmsg">kužnega</w></reg> </choice>

15 The IMP digital library 15 Concordancers The linguistically analysed corpus is made available on the web via two concordancers: nosketchengine: the OS version of the popular (and commercial) SketchEngine CUWI: our front-end to the well-known IMS CWB corpus workbench The concordancers offer: powerful search query syntax (REs over words and annotations) filters over meta-data (text types, year of publication, author, ) various sorting options over concordances construction of frequency lexica collocations saving results etc.

16 nosketchengine The IMP digital library 16

17 CUWI The IMP digital library 17

18 The IMP digital library 18 More than just concordances

19 The IMP digital library 19 Other IMP language resources 1. goo300k gold-standard corpus words, pages page-sampled from IMP DL manually annotated 2. IMP lexicon made on the basis of hand annotated corpus examples also encoded in TEI P5 for browsing on the Web (HTML) as data source for HLT application 3. ToTrTaLe the tool to linguistically analyse historical Slovene texts TEI P5 I/O utilises the IMP lexicon and transcription rules

20 IMP lexicon on the Web The IMP digital library 20

21 The IMP digital library 21 Size of IMP lexicon Size: what we count lemmas modern forms historical forms XL: everything L: words M: historical forms S: archaic words XS: word boundaries

22 The IMP digital library 22 Conclusions Presented the IMP DL, corpus, & lexicon of historical Slovene, available at for teaching / investigating Slovene history for diachronic linguistic investigations for development of HLT

23 The IMP digital library 23 Further work Currently: final clean-up of data Next: re-train ToTrTaLe, re-annotated corpus Offer more output formats: epub, PDF Extend > 1918 (Wikisource) HLT experiments: transcription rules, MT,

The IMP project: developing resources for historical Slovene

The IMP project: developing resources for historical Slovene Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana First ENeL workshop September 29 2014 Bled The IMP project 2