The IMP digital library of Slovene written cultural heritage

Size: px
Start display at page:

Download "The IMP digital library of Slovene written cultural heritage"

Transcription

1 The IMP digital library of Slovene written cultural heritage Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana SEEDI 2013

2 The IMP digital library 2 Overview 1. Background 2. Scope of the library 3. Encoding and presentation 4. IMP corpus and lexicon 5. Conclusions

3 The IMP digital library 3 Background 1. AHLib project ( ) Deutsch-slowenische/kroatische Übersetzung , AAS / KFU (prof. Erich Prunč) + JSI Slovene books translated from German 2. EU IP IMPACT ( ) Improving Access to Text, NUK (Alenka Kavčič Čolić, Ines Vodopivec) + JSI Slovene publications (books and newspapers) 3. Google award ( ) Computational models for historical Slovene, ZRC SAZU (Matija Ogrin) + JSI Samples of v. old books Funding for Wikisource (Miran Hladnik)

4 The IMP digital library 4 Goals Project goals AHLib: develop a corpus to study translation processes IMPACT: develop hand-corrected corpus for improving OCR and lexicon giving modern equivalents of historical words for improving IR Google: develop computational models for historical Slovene, for better language technologies So, the digital library is actually a side-effect IMP DL axioms: facsimiles + proof-read(!) texts uniformly encoded (TEI P5) no problems with further use or dissemination

5 The IMP digital library 5 IMP DL ( ) Source Units Pages Words AHLib KRN NUK WIKI ZRC Units / source AHLib KRN NUK WIKI ZRC Words / year

6 The IMP digital library 6 Annotation on each unit Meta-data (teiheader): id, responsibility, extent, availability basic bibliographic information (two titles: original, modern) taxonomy: medium (manuscript, book, magazine, newspaper), text type (fiction, non-fiction, religious), test status (original, translated) tag usage, revision description Facsimile: images in several sizes, each page break linked to facsimile Text structure: divisions, headings, lists, tables, notes, poems, figures, line breaks Editorial interventions: sic/corr, foreign

7 The IMP digital library 7 TEI P5 encoding: source description <sourcedesc> <bibl> <title type="main">zlata Vas</title> <title type="alt">zlata Vas</title> <author>zschokke, Heinrich</author> <respstmt> <resp xml:lang="sl">prevajalec</resp> <resp xml:lang="en">translator</resp> <name>malavašič, Fran</name> </respstmt> <date>1850</date> <publisher>natisnil Jožef Blaznik</publisher> <pubplace>v Ljubljani</pubPlace> <extent>109+3</extent> <idno>ds54925 NUK - Narodna in univerzitetna knjižnica</idno> <note type="tradok" xml:lang="de"> <ref target=" <lb/>1850<lb/>4260<lb/>[zschokke, Heinrich] (Autor, erschlossen) Malavašič, Fran

8 The IMP digital library 8 TEI P5 encoding: text body <pb n="[3]" facs="#fpg " xml:id="pb.003"/> <div type="level1" xml:id="div.2"> <head xml:id="head.2">1. <lb/>kako Ožbalt iz vojske domú pride in kaj ljudjé govorijo.</head> <figure xml:id="figure.3"> <figdesc>ornamentna sličica. Okrašena črka v.</figdesc> </figure> <p xml:id="p.7">v nedéljo po poldne je bilo in v Zlati Vasi so mlajši fantini in dekleta pod staro lipo sedéli in peli, ali pa se smejali, kadar jo je kdó iz pivnice prilomil, ki je pregloboko v kozarček polukal. Nekteri kmetje s svojimi ženami so pa v gostivnici sedéli in pri bokalu prav židane volje bili, kakor je že to navada, kadar sta vino in vol po ceni.</p> <p xml:id="p.8">kar jo primaha nék neznan človek v vas. Terdne in velike postave je bil in <choice> <sic>kakil</sic> <corr>kakih</corr> </choice> tridesét lét je mogel iméti; obléčen je bil v sivi suknji, na strani je imel veliko sabljo, na herbtu pa

9 The IMP digital library 9 DL on the Web We use (slightly modified) TEI XSLT stylesheets to convert TEI to HTML Each unit is one HTML file, showing both the facsimile and (typeset) transcription Indexes to books are by taxonomy, sorted by (one of) author title date signature One index also shows title pages

10 Example book: front The IMP digital library 10

11 Example book: body The IMP digital library 11

12 Example index The IMP digital library 12

13 Example index The IMP digital library 13

14 The IMP digital library 14 DL as corpus We have also developed a tool to automatically: 1. Tokenise the text (split it into words, punctuation and whitespace) 2. Modernise the words in the text 3. Tag the words with morphosyntactic descriptions 4. Lemmatise the words Example TEI encoding Ako se ne združimo tiga[tega] kužniga[kužnega]... <s> <w lemma="ako" ana="cs">ako</w><c> </c> <w lemma="se" ana="px------y">se</w><c> </c> <w lemma="ne" ana="q">ne</w><c> </c> <w lemma="združiti" ana="vmer1p">združimo</w><c> </c> <choice> <orig><w>tiga</w></orig> <reg><w lemma="ta" ana="pd-msg">tega</w></reg> </choice> <c> </c> <choice> <orig><w>kužniga</w></orig> <reg type="pattern" n="[ega@ iga@]"><w lemma="kužen" ana="agpmsg">kužnega</w></reg> </choice>

15 The IMP digital library 15 Concordancers The linguistically analysed corpus is made available on the web via two concordancers: nosketchengine: the OS version of the popular (and commercial) SketchEngine CUWI: our front-end to the well-known IMS CWB corpus workbench The concordancers offer: powerful search query syntax (REs over words and annotations) filters over meta-data (text types, year of publication, author, ) various sorting options over concordances construction of frequency lexica collocations saving results etc.

16 nosketchengine The IMP digital library 16

17 CUWI The IMP digital library 17

18 The IMP digital library 18 More than just concordances

19 The IMP digital library 19 Other IMP language resources 1. goo300k gold-standard corpus words, pages page-sampled from IMP DL manually annotated 2. IMP lexicon made on the basis of hand annotated corpus examples also encoded in TEI P5 for browsing on the Web (HTML) as data source for HLT application 3. ToTrTaLe the tool to linguistically analyse historical Slovene texts TEI P5 I/O utilises the IMP lexicon and transcription rules

20 IMP lexicon on the Web The IMP digital library 20

21 The IMP digital library 21 Size of IMP lexicon Size: what we count lemmas modern forms historical forms XL: everything L: words M: historical forms S: archaic words XS: word boundaries

22 The IMP digital library 22 Conclusions Presented the IMP DL, corpus, & lexicon of historical Slovene, available at for teaching / investigating Slovene history for diachronic linguistic investigations for development of HLT

23 The IMP digital library 23 Further work Currently: final clean-up of data Next: re-train ToTrTaLe, re-annotated corpus Offer more output formats: epub, PDF Extend > 1918 (Wikisource) HLT experiments: transcription rules, MT,

The IMP project: developing resources for historical Slovene

The IMP project: developing resources for historical Slovene The IMP project: developing resources for historical Slovene Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana First ENeL workshop September 29 2014 Bled The IMP project 2

More information

An Architecture for Editing Complex Digital Documents

An Architecture for Editing Complex Digital Documents An Architecture for Editing Complex Digital Documents Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Jamova cesta 39, Ljubljana, Slovenia tomaz.erjavec@ijs.si Summary In several on-going

More information

Digital Critical Editions of Slovenian Literature: an Application of Collaborative Work Using Open Standards

Digital Critical Editions of Slovenian Literature: an Application of Collaborative Work Using Open Standards Digital Critical Editions of Slovenian Literature: an Application of Collaborative Work Using Open Standards Tomaž Erjavec 1, Matija Ogrin 2 1 Department of Knowledge Technologies, Jožef Stefan Institute

More information

Part A: Getting started 1. Open the <oxygen/> editor (with a blue icon, not the author mode with a red icon).

Part A: Getting started 1. Open the <oxygen/> editor (with a blue icon, not the author mode with a red icon). DIGITAL PUBLISHING AND PRESERVATION USING TEI http://www.lib.umich.edu/digital-publishing-production/digital-publishing-and-preservation-using-tei-november-13-2010 Introductory TEI encoding 1 This exercise

More information

E-SLOMŠEK: A TEI ENCODING OF A CRITICAL EDITION OF 19TH CENTURY SLOVENIAN RHETORIC PROSE

E-SLOMŠEK: A TEI ENCODING OF A CRITICAL EDITION OF 19TH CENTURY SLOVENIAN RHETORIC PROSE Преглед НЦД 5 (2004), 31 41 Tomaž Erjavec (Jožef Stefan Institute, Ljubljana) Matija Ogrin, Jože Faganel (Institute of Slovenian Literature, Ljubljana) E-SLOMŠEK: A TEI ENCODING OF A CRITICAL EDITION OF

More information

Contents. List of Figures. List of Tables. Acknowledgements

Contents. List of Figures. List of Tables. Acknowledgements Contents List of Figures List of Tables Acknowledgements xiii xv xvii 1 Introduction 1 1.1 Linguistic Data Analysis 3 1.1.1 What's data? 3 1.1.2 Forms of data 3 1.1.3 Collecting and analysing data 7 1.2

More information

Text Encoding Fundamentals: Element list

Text Encoding Fundamentals: Element list Text Encoding Fundamentals: Element list Elements for basic TEI documents This is more of a brief reference sheet than an exhaustive list of TEI elements: it is intended to provide you with a way to look

More information

Transcription, Proofing, and Coding Protocols for Primary Works in the. By Mark L.Kamrath, Philip Barnard, Will Dorner, and Amy Giroux

Transcription, Proofing, and Coding Protocols for Primary Works in the. By Mark L.Kamrath, Philip Barnard, Will Dorner, and Amy Giroux Transcription, Proofing, and Coding Protocols for Primary Works in the Charles Brockden Brown Electronic Archive 1 By Mark L.Kamrath, Philip Barnard, Will Dorner, and Amy Giroux April 21, 2016 I.Textual

More information

The XML encoding and digital tools of the ZTS edition

The XML encoding and digital tools of the ZTS edition Elisa Veit, The ZTS mark-up manual pdf, in Swedish topelius.fi: Anvisningar 2 Facsimile Transcription Website High resolution In colour Colour targets 3 Marking up a letter in XML Transcription and encoding

More information

Comp 336/436 - Markup Languages. Fall Semester Week 2. Dr Nick Hayward

Comp 336/436 - Markup Languages. Fall Semester Week 2. Dr Nick Hayward Comp 336/436 - Markup Languages Fall Semester 2017 - Week 2 Dr Nick Hayward Digitisation - textual considerations comparable concerns with music in textual digitisation density of data is still a concern

More information

Introducing XAIRA. Lou Burnard Tony Dodd. An XML aware tool for corpus indexing and searching. Research Technology Services, OUCS

Introducing XAIRA. Lou Burnard Tony Dodd. An XML aware tool for corpus indexing and searching. Research Technology Services, OUCS Introducing XAIRA An XML aware tool for corpus indexing and searching Lou Burnard Tony Dodd Research Technology Services, OUCS What is XAIRA? XML Aware Indexing and Retrieval Architecture Developed from

More information

Digitally Preserving African Heritage

Digitally Preserving African Heritage Digitally Preserving African Heritage Hussein Suleman hussein@cs.uct.ac.za University of Cape Town Department of Computer Science Centre for ICT for Development Digital Libraries Laboratory April 2016

More information

Research Tools: DIY Text Tools

Research Tools: DIY Text Tools As with the other Research Tools, the DIY Text Tools are primarily designed for small research projects at the undergraduate level. What are the DIY Text Tools for? These tools are designed to help you

More information

Parallel Concordancing and Translation. Michael Barlow

Parallel Concordancing and Translation. Michael Barlow [Translating and the Computer 26, November 2004 [London: Aslib, 2004] Parallel Concordancing and Translation Michael Barlow Dept. of Applied Language Studies and Linguistics University of Auckland Auckland,

More information

Growing interests in. Urgent needs of. Develop a fieldworkers toolkit (fwtk) for the research of endangered languages

Growing interests in. Urgent needs of. Develop a fieldworkers toolkit (fwtk) for the research of endangered languages ELPR IV International Conference 2002 Topics Reitaku University College of Foreign Languages Developing Tools for Creating-Maintaining-Analyzing Field Shoju CHIBA Reitaku University, Japan schiba@reitaku-u.ac.jp

More information

The JOS morphosyntactically tagged corpus of Slovene

The JOS morphosyntactically tagged corpus of Slovene The JOS morphosyntactically tagged corpus of Slovene Tomaž Erjavec, Simon Krek Dept. of Knowledge Technologies, Jožef Stefan Institute Jamova cesta 39, SI-1000 Ljubljana, Slovenia tomaz.erjavec@ijs.si,

More information

Encoding Biomedical Resources in TEI: the Case of the GENIA Corpus

Encoding Biomedical Resources in TEI: the Case of the GENIA Corpus Encoding Biomedical Resources in TEI: the Case of the GENIA Corpus Tomaž Erjavec Dept. of Intelligent Systems Jožef Stefan Institute, Ljubljana Yuka Tateisi CREST Japan Science and Technology Corporation

More information

LING203: Corpus. March 9, 2009

LING203: Corpus. March 9, 2009 LING203: Corpus March 9, 2009 Corpus A collection of machine readable texts SJSU LLD have many corpora http://linguistics.sjsu.edu/bin/view/public/chltcorpora Each corpus has a link to a description page

More information

Data for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit

Data for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit Data for linguistics ALEXIS DIMITRIADIS Text, corpora, and data in the wild 1. Where does language data come from? The usual: Introspection, questionnaires, etc. Corpora, suited to the domain of study:

More information

Converting and Representing Social Media Corpora into TEI: Schema and Best Practices from CLARIN-D

Converting and Representing Social Media Corpora into TEI: Schema and Best Practices from CLARIN-D Converting and Representing Social Media Corpora into TEI: Schema and Best Practices from CLARIN-D Michael Beißwenger, Eric Ehrhardt, Axel Herold, Harald Lüngen, Angelika Storrer Background of this talk:

More information

The Functional Extension Parser (FEP) A Document Understanding Platform

The Functional Extension Parser (FEP) A Document Understanding Platform The Functional Extension Parser (FEP) A Document Understanding Platform Günter Mühlberger University of Innsbruck Department for German Language and Literature Studies Introduction A book is more than

More information

Introduction to Text Mining. Aris Xanthos - University of Lausanne

Introduction to Text Mining. Aris Xanthos - University of Lausanne Introduction to Text Mining Aris Xanthos - University of Lausanne Preliminary notes Presentation designed for a novice audience Text mining = text analysis = text analytics: using computational and quantitative

More information

Recent Developments in the Czech National Corpus

Recent Developments in the Czech National Corpus Recent Developments in the Czech National Corpus Michal Křen Charles University in Prague 3 rd Workshop on the Challenges in the Management of Large Corpora Lancaster 20 July 2015 Introduction of the project

More information

Towards an Independent Search Engine for Linguists: Issues and Solutions

Towards an Independent Search Engine for Linguists: Issues and Solutions Towards an Independent Search Engine for Linguists: Issues and Solutions La Rete come Corpus Forlì 14 January 2005 William H. Fletcher United States Naval Academy (2004-05 05 Radboud University of Nijmegen)

More information

Encoding of manuscripts using the TEI

Encoding of manuscripts using the TEI Encoding of manuscripts using the TEI M. J. Driscoll Arnamagnæan Institute University of Copenhagen mjd@hum.ku.dk TEI Workshop Azbuky.net Sofia, Bulgaria 24 26 October 2005 Encoding primary sources The

More information

Using Databases. What is a Database? Selecting a Database. Using Databases Published on E.J. Pratt Library (http://library.vicu.utoronto.

Using Databases. What is a Database? Selecting a Database. Using Databases Published on E.J. Pratt Library (http://library.vicu.utoronto. What is a Database? It is a research tool that contains specific types of literature not found in the library catalogue [1]. Depending on the scope, a database can be used to locate the following: journal

More information

How to deposit your accepted paper in ORA through Symplectic

How to deposit your accepted paper in ORA through Symplectic How to deposit your accepted paper in ORA through Symplectic Act on Acceptance: when you ve had a journal article or conference paper accepted for publication, deposit the accepted manuscript 1 into ORA

More information

Utilising ANNIS for search and analysis of historical data

Utilising ANNIS for search and analysis of historical data Utilising ANNIS for search and analysis of historical data Stephan Druskat Thomas Krause Carolin Odebrecht Institut für deutsche Sprache und Linguistik Humboldt-Universität zu Berlin Reuse or New Development:

More information

Monk Datastore Workflow. September 28, 2009

Monk Datastore Workflow. September 28, 2009 Monk Datastore Workflow September 28, 2009 Seven Stages 1. Text selection 2. Text normalization 3. Morphological adornment 4. Bibliographic enhancement 5. Database input generation 6. Database creation

More information

Lisa Biagini & Eugenio Picchi, Istituto di Linguistica CNR, Pisa

Lisa Biagini & Eugenio Picchi, Istituto di Linguistica CNR, Pisa Lisa Biagini & Eugenio Picchi, Istituto di Linguistica CNR, Pisa Computazionale, INTERNET and DBT Abstract The advent of Internet has had enormous impact on working patterns and development in many scientific

More information

TEI and Databases. TEI and Databases. Øyvind Eide. June 2009

TEI and Databases. TEI and Databases. Øyvind Eide. June 2009 Øyvind Eide June 2009 Overview Introduction Short history Types of connections between TEI documents and databases Examples Conclusions Overview Introduction Short history Types of connections between

More information

Text Mining. Representation of Text Documents

Text Mining. Representation of Text Documents Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,

More information

Markup Enhancement: Converting CEE. Dictionaries into TEI, and Beyond. Tomaz Erjavec.

Markup Enhancement: Converting CEE. Dictionaries into TEI, and Beyond. Tomaz Erjavec. Markup Enhancement: Converting CEE Dictionaries into TEI, and Beyond Tomaz Erjavec Tomaz.Erjavec@ijs.si Department of Intelligent Systems, Jozef Stefan Institute Jamova 39, SI-1000 Ljubljana, Slovenia

More information

ConcorDance. A Simple Concordance Interface for Search Engines

ConcorDance. A Simple Concordance Interface for Search Engines KTH Stockholm October 26, 2005 Skolan för Datavetenskap och Kommunikation Numerisk analys och datalogi Course: 2D1418 Språkteknologi Autumn Term 2005 Course Instructor: Ola Knutsson ConcorDance A Simple

More information

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17 Information Retrieval Vannevar Bush Director of the Office of Scientific Research and Development (1941-1947) Vannevar Bush,1890-1974 End of WW2 - what next big challenge for scientists? 1 Historic Vision

More information

2014/09/01 Workshop on Finite-State Language Resources Sofia. Local Grammars 1. Éric Laporte

2014/09/01 Workshop on Finite-State Language Resources Sofia. Local Grammars 1. Éric Laporte 2014/09/01 Workshop on Finite-State Language Resources Sofia Local Grammars 1 Éric Laporte Concordance Outline Local grammar of dates Invoking a subgraph Lexical masks Dictionaries of a text 01/09/2014

More information

Abstract. Background The File Description Title Statement Edition Statement

Abstract. Background The File Description Title Statement Edition Statement Core text identification for full-text databases Lisa A Lehman, Assistant Professor of Information Science, Rasmuson Library, University of Alaska Fairbanks John A. Lehman, Professor of Accounting and

More information

Ortolang Tools : MarsaTag

Ortolang Tools : MarsaTag Ortolang Tools : MarsaTag Stéphane Rauzy, Philippe Blache, Grégoire de Montcheuil SECOND VARIAMU WORKSHOP LPL, Aix-en-Provence August 20th & 21st, 2014 ORTOLANG received a State aid under the «Investissements

More information

The Corpus Thread Reference corpus of general language

The Corpus Thread Reference corpus of general language The Corpus Thread Reference corpus of general language The complete documentation of the DK-CLARIN WP 2.1 Project Jørg Asmussen ja@dsl.dk Det Danske Sprog- og Litteraturselskab Society for Danish Language

More information

Automatic Bangla Corpus Creation

Automatic Bangla Corpus Creation Automatic Bangla Corpus Creation Asif Iqbal Sarkar, Dewan Shahriar Hossain Pavel and Mumit Khan BRAC University, Dhaka, Bangladesh asif@bracuniversity.net, pavel@bracuniversity.net, mumit@bracuniversity.net

More information

Metadata: The Theory Behind the Practice

Metadata: The Theory Behind the Practice Metadata: The Theory Behind the Practice Item Type Presentation Authors Coleman, Anita Sundaram Citation Metadata: The Theory Behind the Practice 2002-04, Download date 06/07/2018 12:18:20 Link to Item

More information

Creating Digital Scholarly Editions: An Introduction to the Text Encoding Initiative (TEI)

Creating Digital Scholarly Editions: An Introduction to the Text Encoding Initiative (TEI) University of Michigan Deep Blue deepblue.lib.umich.edu 2011-03-19 Creating Digital Scholarly Editions: An Introduction to the Text Encoding Initiative (TEI) Welzenbach, Rebecca; Schaffner, Paul; Hawkins,

More information

The PC And Gadget Help Desk: A Do-It-Yourself Guide To Troubleshooting And Repairing By Mark Edward Soper

The PC And Gadget Help Desk: A Do-It-Yourself Guide To Troubleshooting And Repairing By Mark Edward Soper The PC And Gadget Help Desk: A Do-It-Yourself Guide To Troubleshooting And Repairing By Mark Edward Soper If you are searching for a ebook The PC and Gadget Help Desk: A Do-It- Yourself Guide To Troubleshooting

More information

Standards for language encoding: Sharing resources

Standards for language encoding: Sharing resources Standards for language encoding: Sharing resources Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute ESSLLI 2011 Sharing language resources Copyright Making information about resources

More information

A Language Research Workbench Software Architecture

A Language Research Workbench Software Architecture A Language Research Workbench Software Architecture 47 Rooks December 2015 revised December 2016 Introduction I have used a number of high quality bible study software programs, Accordance, Logos, MySword

More information

Semantic media application with user created content to enhance enjoying cultural heritage

Semantic media application with user created content to enhance enjoying cultural heritage Semantic media application with user created content to enhance enjoying cultural heritage Sari Vainikainen, Asta Bäck, Pirjo Näkki Digital Semantic Content across Cultures the Louvre, Paris, May 4-5,

More information

Activity Report at SYSTRAN S.A.

Activity Report at SYSTRAN S.A. Activity Report at SYSTRAN S.A. Pierre Senellart September 2003 September 2004 1 Introduction I present here work I have done as a software engineer with SYSTRAN. SYSTRAN is a leading company in machine

More information

Post Digitization: Challenges in Managing a Dynamic Dataset. Jasper Faase, 12 April 2012

Post Digitization: Challenges in Managing a Dynamic Dataset. Jasper Faase, 12 April 2012 Post Digitization: Challenges in Managing a Dynamic Dataset Jasper Faase, 12 April 2012 Post Digitization: Challenges in Managing a Dynamic Dataset Mission The Koninklijke Bibliotheek is the national library

More information

Unit 3 Corpus markup

Unit 3 Corpus markup Unit 3 Corpus markup 3.1 Introduction Data collected using a sampling frame as discussed in unit 2 forms a raw corpus. Yet such data typically needs to be processed before use. For example, spoken data

More information

Corpus Building with TEC Tools Version 2.2 (May 2011) Notes and Disclaimer 2

Corpus Building with TEC Tools Version 2.2 (May 2011) Notes and Disclaimer 2 Corpus Building with TEC Tools Version 2.2 (May 2011) Contents Page Notes and Disclaimer 2 1. Scanning and Converting Images to Text 3 1.1 Scanning Documents 1.2 Choosing OCR Software 1.3 Extracting Text

More information

How to Build a Digital Library

How to Build a Digital Library How to Build a Digital Library Ian H. Witten & David Bainbridge Contents Preface Acknowledgements i iv 1. Orientation: The world of digital libraries 1 One: Supporting human development 1 Two: Pushing

More information

ENCODING TEXTS FOR VISUALIZATION AND ANALYSES USING THE TEI STANDARD

ENCODING TEXTS FOR VISUALIZATION AND ANALYSES USING THE TEI STANDARD ENCODING TEXTS FOR VISUALIZATION AND ANALYSES USING THE TEI STANDARD William Chong http://dlinkup.com/workshops.html big D ata digitally R eady E ncoded A nalyzable M eaningful PRINCIPLES WORKSHOP ATTACK

More information

Terminologies, Knowledge Organization Systems, Ontologies

Terminologies, Knowledge Organization Systems, Ontologies Terminologies, Knowledge Organization Systems, Ontologies Gerhard Budin University of Vienna TSS July 2012, Vienna Motivation and Purpose Knowledge Organization Systems In this unit of TSS 12, we focus

More information

Web-Based Corpus Software

Web-Based Corpus Software Web-Based Corpus Software CTS 03 Workshop/Tutorial - Pretoria, South Africa Saturnino Luz mailto:luzs@cs.tcd.ie Trinity College, Department of Computer Science 17th February 2004 Web-based corpus?? 2/75-1

More information

Lou Burnard Consulting

Lou Burnard Consulting Getting started with oxygen Lou Burnard Consulting 2014-06-21 1 Introducing oxygen In this first exercise we will use oxygen to : create a new XML document gradually add markup to the document carry out

More information

Automated Tagging to Enable Fine-Grained Browsing of Lecture Videos

Automated Tagging to Enable Fine-Grained Browsing of Lecture Videos Automated Tagging to Enable Fine-Grained Browsing of Lecture Videos K.Vijaya Kumar (09305081) under the guidance of Prof. Sridhar Iyer June 28, 2011 1 / 66 Outline Outline 1 Introduction 2 Motivation 3

More information

How to deposit your accepted paper in ORA through Symplectic

How to deposit your accepted paper in ORA through Symplectic How to deposit your accepted paper in ORA through Symplectic Act on Acceptance: when you ve had a journal article or conference paper accepted for publication, deposit the accepted manuscript 1 into ORA

More information

Slovene Terminology Web Portal and the TBX-Compatible Simplified DTD/schema

Slovene Terminology Web Portal and the TBX-Compatible Simplified DTD/schema Slovene Terminology Web Portal and the TBX-Compatible Simplified DTD/schema Simon Krek,* Vojko Gorjanc,** Špela Arhar,*** * Department for Knowledge Technologies, "Jožef Stefan" Institute, Jamova cesta

More information

Summary of Bird and Simons Best Practices

Summary of Bird and Simons Best Practices Summary of Bird and Simons Best Practices 6.1. CONTENT (1) COVERAGE Coverage addresses the comprehensiveness of the language documentation and the comprehensiveness of one s documentation of one s methodology.

More information

TEI, METS and ALTO, why we need all of them. Günter Mühlberger University of Innsbruck Digitisation and Digital Preservation

TEI, METS and ALTO, why we need all of them. Günter Mühlberger University of Innsbruck Digitisation and Digital Preservation TEI, METS and ALTO, why we need all of them Günter Mühlberger University of Innsbruck Digitisation and Digital Preservation Agenda Introduction Problem statement Proposed solution Starting point Mass digitisation

More information

TEI-encoding for the Integrated Language Database of 8th - 21st-Century Dutch

TEI-encoding for the Integrated Language Database of 8th - 21st-Century Dutch POSTER SESSION TEI-encoding for the Integrated Language Database of 8th - 21st-Century Dutch Katrien Depuydt, Tilly Dutilh-Ruitenberg Instituut voor Nederlandse Lexicologie P.O.Box9515 NL-2300 PvA, Leiden

More information

A BNC-like corpus of American English

A BNC-like corpus of American English The American National Corpus Everything You Always Wanted To Know... And Weren t Afraid To Ask Nancy Ide Department of Computer Science Vassar College What is the? A BNC-like corpus of American English

More information

Processing XML Text with Python and ElementTree a Practical Experience

Processing XML Text with Python and ElementTree a Practical Experience Processing XML Text with Python and ElementTree a Practical Experience Radovan Garabík L udovít Štúr Institute of Linguistics Slovak Academy of Sciences Bratislava, Slovakia Abstract In this paper, we

More information

The Turkish National Corpus (TNC): Comparing the Architectures of v1 and v2

The Turkish National Corpus (TNC): Comparing the Architectures of v1 and v2 The Turkish National Corpus (): Comparing the Architectures and Yeşim Aksan Selma Ayşe Özel Mersin University Mersin, Turkey yesimaksan@gmail.com Çukurova University Adana, Turkey saozel@gmail.com Hakan

More information

Bringing Europeana and CLARIN together: Dissemination and exploitation of cultural heritage data in a research infrastructure

Bringing Europeana and CLARIN together: Dissemination and exploitation of cultural heritage data in a research infrastructure Bringing Europeana and CLARIN together: Dissemination and exploitation of cultural heritage data in a research infrastructure Twan Goosen 1 (CLARIN ERIC), Nuno Freire 2, Clemens Neudecker 3, Maria Eskevich

More information

IMPROVING YOUR JOURNAL WORKFLOW

IMPROVING YOUR JOURNAL WORKFLOW IMPROVING YOUR JOURNAL WORKFLOW BEST PRACTICES FOR THE MODERN JOURNAL OFFICE IAN POTTER GLOBAL BUSINESS DEVELOPMENT MANAGER, PUBLISHING & ASSOCIATIONS THOMSON REUTERS BANDUNG, INDONESIA, THURSDAY 7TH APRIL

More information

Interactive Handwritten Text Recognition and Indexing of Historical Documents: the transcriptorum Project

Interactive Handwritten Text Recognition and Indexing of Historical Documents: the transcriptorum Project Interactive Handwritten Text Recognition and ing of Historical Documents: the transcriptorum Project Alejandro H. Toselli ahector@prhlt.upv.es Pattern Recognition and Human Language Technology Reseach

More information

Modeling Linguistic Research Data for a Repository for Historical Corpora. Carolin Odebrecht Humboldt-Universität zu Berlin LAUDATIO-repository.

Modeling Linguistic Research Data for a Repository for Historical Corpora. Carolin Odebrecht Humboldt-Universität zu Berlin LAUDATIO-repository. Modeling Linguistic Research Data for a Repository for Historical Corpora Carolin Odebrecht Humboldt-Universität zu Berlin LAUDATIO-repository.org Motivation to enable the search for, the access of and

More information

Comp 336/436 - Markup Languages. Fall Semester Week 4. Dr Nick Hayward

Comp 336/436 - Markup Languages. Fall Semester Week 4. Dr Nick Hayward Comp 336/436 - Markup Languages Fall Semester 2018 - Week 4 Dr Nick Hayward XML - recap first version of XML became a W3C Recommendation in 1998 a useful format for data storage and exchange config files,

More information

How to deposit your accepted paper in ORA through Symplectic

How to deposit your accepted paper in ORA through Symplectic How to deposit your accepted paper in ORA through Symplectic Act on Acceptance: when you ve had a journal article or conference paper accepted for publication, deposit the accepted manuscript 1 into ORA

More information

MadCap Flare Training

MadCap Flare Training MadCap Flare Training Course Overview Welcome Audience Course Overview Preparing Your Computer for the Course Flare Overview What Is Flare? Getting Around in Flare User Interface Ribbon or Toolbars Projects

More information

EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates

EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates Alina Karakanta, Mihaela Vela, Elke Teich Department of Language Science and Technology, Saarland University Outline Introduction

More information

The American National Corpus First Release

The American National Corpus First Release The American National Corpus First Release Nancy Ide and Keith Suderman Department of Computer Science, Vassar College, Poughkeepsie, NY 12604-0520 USA ide@cs.vassar.edu, suderman@cs.vassar.edu Abstract

More information

Final Project Discussion. Adam Meyers Montclair State University

Final Project Discussion. Adam Meyers Montclair State University Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...

More information

Self Introduction. Presentation Outline. College of Information 3/31/2016. Multilingual Information Access to Digital Collections

Self Introduction. Presentation Outline. College of Information 3/31/2016. Multilingual Information Access to Digital Collections College of Information Multilingual Information Access to Digital Collections Jiangping Chen Http://coolt.lis.unt.edu/ Jiangping.chen@unt.edu April 20, 2016 Self Introduction An Associate Professor at

More information

lexidb: A Scalable Corpus Database Management System

lexidb: A Scalable Corpus Database Management System lexidb: A Scalable Corpus Database Management System Matthew Coole, Paul Rayson and John Mariani Abstract lexidb is a scalable corpus database management system designed to fulfill corpus linguistics retrieval

More information

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics Media Intelligence Business intelligence (BI) Uses data mining techniques and tools for the transformation of raw data into meaningful

More information

XML Metadata Standards and Topic Maps

XML Metadata Standards and Topic Maps XML Metadata Standards and Topic Maps Erik Wilde 16.7.2001 XML Metadata Standards and Topic Maps 1 Outline what is XML? a syntax (not a data model!) what is the data model behind XML? XML Information Set

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval WS 2008/2009 25.11.2008 Information Systems Group Mohammed AbuJarour Contents 2 Basics of Information Retrieval (IR) Foundations: extensible Markup Language (XML)

More information

Effective searching strategies and techniques

Effective searching strategies and techniques Effective searching strategies and techniques Getting the most from electronic information resources Objectives To understand the importance of effective searching To develop guidelines for planning and

More information

Experimental Deployment of a Grid Virtual Organization for Human Language Technologies

Experimental Deployment of a Grid Virtual Organization for Human Language Technologies Experimental Deployment of a Grid Virtual Organization for Human Language Technologies Jan Jona Javoršek, Tomaž Erjavec Jožef Stefan Institute Jamova ulica 39, SI-1000 Ljubljana, Slovenia jan.javorsek@ijs.si,

More information

BUDDHIST STONE SCRIPTURES FROM SHANDONG, CHINA

BUDDHIST STONE SCRIPTURES FROM SHANDONG, CHINA BUDDHIST STONE SCRIPTURES FROM SHANDONG, CHINA Heidelberg Academy of Sciences and Humanities Research Group Buddhist Stone Scriptures in China Hauptstraße 113 69117 Heidelberg Germany marnold@zo.uni-heidelberg.de

More information

Digitizing Historic Newspapers

Digitizing Historic Newspapers Digitizing Historic Newspapers the University of Utah Way Presented by Scott Christensen iarchives, Inc. July 14, 2005 Agenda 3 Keys to a Quality Digitized Product Processing Methodology Q&A 3 Keys - Introduction

More information

Sustainability of Text-Technological Resources

Sustainability of Text-Technological Resources Sustainability of Text-Technological Resources Maik Stührenberg, Michael Beißwenger, Kai-Uwe Kühnberger, Harald Lüngen, Alexander Mehler, Dieter Metzing, Uwe Mönnich Research Group Text-Technological Overview

More information

A Register of Early Modern Slovenian Manuscripts

A Register of Early Modern Slovenian Manuscripts Journal of the Text Encoding Initiative Issue 4 2013 Selected Papers from the 2011 TEI Conference A Register of Early Modern Slovenian Manuscripts Matija Ogrin, Jan Jona Javoršek and Tomaž Erjavec Electronic

More information

UC Irvine Unicode Project

UC Irvine Unicode Project UC Irvine Unicode Project Title A proposal to encode New Testament editorial characters in the UCS Permalink https://escholarship.org/uc/item/6r10d7w1 Authors Pantelia, Maria C. Peevers, Richard Publication

More information

Historical Text Mining:

Historical Text Mining: Historical Text Mining Historical Text Mining, and Historical Text Mining: Challenges and Opportunities Dr. Robert Sanderson Dept. of Computer Science University of Liverpool azaroth@liv.ac.uk http://www.csc.liv.ac.uk/~azaroth/

More information

The Mormon Diaries Project

The Mormon Diaries Project The Mormon Diaries Project Scott Eldredge, Digital Initiatives Program Manager Harold B. Lee Library Frederick Zarndt, CTO iarchives What Is Transcription? Transcribe v.t. 1. To write over again; copy

More information

Introduction to XML. Asst. Prof. Dr. Kanda Runapongsa Saikaew Dept. of Computer Engineering Khon Kaen University

Introduction to XML. Asst. Prof. Dr. Kanda Runapongsa Saikaew Dept. of Computer Engineering Khon Kaen University Introduction to XML Asst. Prof. Dr. Kanda Runapongsa Saikaew Dept. of Computer Engineering Khon Kaen University http://gear.kku.ac.th/~krunapon/xmlws 1 Topics p What is XML? p Why XML? p Where does XML

More information

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO INDEX Proposal Recap Implementation Evaluation Future Works Proposal Recap Keyword Visualizer (chrome

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Scaling Out For Extreme Scale Corpus Data

Scaling Out For Extreme Scale Corpus Data Scaling Out For Extreme Scale Corpus Data Matthew Coole, Paul Rayson and John Mariani School of Computing and Communications Lancaster University Lancaster, Lancashire, UK m.coole@lancaster.ac.uk, p.rayson@lancaster.ac.uk,

More information

CoRoLa Starts Blooming An update on the Reference Corpus of Contemporary Romanian Language

CoRoLa Starts Blooming An update on the Reference Corpus of Contemporary Romanian Language CoRoLa Starts Blooming An update on the Reference Corpus of Contemporary Romanian Language Dan Tufiș, Verginica Barbu Mititelu, Elena Irimia, Ștefan Daniel Dumitrescu, Tiberiu Boroș Research Institute

More information

CONTENTdm & The Digital Collection Gateway New Looks for Discovery and Delivery

CONTENTdm & The Digital Collection Gateway New Looks for Discovery and Delivery CONTENTdm & The Digital Collection Gateway New Looks for Discovery and Delivery EVERY CONNECTION has a starting point. OCLC EMEA Regional Council Meeting Deutsche Nationalbibliothek Frankfurt 2 nd March

More information

ARKive-ERA Project Lessons and Thoughts

ARKive-ERA Project Lessons and Thoughts ARKive-ERA Project Lessons and Thoughts Semantic Web for Scientific and Cultural Organisations Convitto della Calza 17 th June 2003 Paul Shabajee (ILRT, University of Bristol) 1 Contents Context Digitisation

More information

Re-designing Online Terminology Resources for German Grammar

Re-designing Online Terminology Resources for German Grammar Re-designing Online Terminology Resources for German Grammar Project Report Karolina Suchowolec, Christian Lang, and Roman Schneider Institut für Deutsche Sprache (IDS), Mannheim, Germany {suchowolec,

More information

Semantics Isn t Easy Thoughts on the Way Forward

Semantics Isn t Easy Thoughts on the Way Forward Semantics Isn t Easy Thoughts on the Way Forward NANCY IDE, VASSAR COLLEGE REBECCA PASSONNEAU, COLUMBIA UNIVERSITY COLLIN BAKER, ICSI/UC BERKELEY CHRISTIANE FELLBAUM, PRINCETON UNIVERSITY New York University

More information

MT+ Beneficiary Guide

MT+ Beneficiary Guide MT+ Beneficiary Guide Introduction... 2 How to get access... 3 Login... 4 Automatic notifications... 8 Menu and Navigation... 9 List functionalities... 12 Project Details... 18 How to manage organisations...

More information

Standards for language encoding: ISO

Standards for language encoding: ISO Standards for language encoding: ISO Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute ESSLLI 2011 Overview of the lecture 1. How ISO works 2. ISO TC 37 3. Dates, times & languages 4.

More information

Bulgarian Folk Songs in a Digital Library

Bulgarian Folk Songs in a Digital Library N. Kirov 1,2 L. Peycheva 3 1 Department Informatics, New Bulgarian University 2 Institute of Mathematics and Informatics, Bulgarian Academy of Sciences 3 Institute for Ethnology and Folklore Studies with

More information