The IMP digital library of Slovene written cultural heritage
|
|
- Caitlin Willis
- 5 years ago
- Views:
Transcription
1 The IMP digital library of Slovene written cultural heritage Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana SEEDI 2013
2 The IMP digital library 2 Overview 1. Background 2. Scope of the library 3. Encoding and presentation 4. IMP corpus and lexicon 5. Conclusions
3 The IMP digital library 3 Background 1. AHLib project ( ) Deutsch-slowenische/kroatische Übersetzung , AAS / KFU (prof. Erich Prunč) + JSI Slovene books translated from German 2. EU IP IMPACT ( ) Improving Access to Text, NUK (Alenka Kavčič Čolić, Ines Vodopivec) + JSI Slovene publications (books and newspapers) 3. Google award ( ) Computational models for historical Slovene, ZRC SAZU (Matija Ogrin) + JSI Samples of v. old books Funding for Wikisource (Miran Hladnik)
4 The IMP digital library 4 Goals Project goals AHLib: develop a corpus to study translation processes IMPACT: develop hand-corrected corpus for improving OCR and lexicon giving modern equivalents of historical words for improving IR Google: develop computational models for historical Slovene, for better language technologies So, the digital library is actually a side-effect IMP DL axioms: facsimiles + proof-read(!) texts uniformly encoded (TEI P5) no problems with further use or dissemination
5 The IMP digital library 5 IMP DL ( ) Source Units Pages Words AHLib KRN NUK WIKI ZRC Units / source AHLib KRN NUK WIKI ZRC Words / year
6 The IMP digital library 6 Annotation on each unit Meta-data (teiheader): id, responsibility, extent, availability basic bibliographic information (two titles: original, modern) taxonomy: medium (manuscript, book, magazine, newspaper), text type (fiction, non-fiction, religious), test status (original, translated) tag usage, revision description Facsimile: images in several sizes, each page break linked to facsimile Text structure: divisions, headings, lists, tables, notes, poems, figures, line breaks Editorial interventions: sic/corr, foreign
7 The IMP digital library 7 TEI P5 encoding: source description <sourcedesc> <bibl> <title type="main">zlata Vas</title> <title type="alt">zlata Vas</title> <author>zschokke, Heinrich</author> <respstmt> <resp xml:lang="sl">prevajalec</resp> <resp xml:lang="en">translator</resp> <name>malavašič, Fran</name> </respstmt> <date>1850</date> <publisher>natisnil Jožef Blaznik</publisher> <pubplace>v Ljubljani</pubPlace> <extent>109+3</extent> <idno>ds54925 NUK - Narodna in univerzitetna knjižnica</idno> <note type="tradok" xml:lang="de"> <ref target=" <lb/>1850<lb/>4260<lb/>[zschokke, Heinrich] (Autor, erschlossen) Malavašič, Fran
8 The IMP digital library 8 TEI P5 encoding: text body <pb n="[3]" facs="#fpg " xml:id="pb.003"/> <div type="level1" xml:id="div.2"> <head xml:id="head.2">1. <lb/>kako Ožbalt iz vojske domú pride in kaj ljudjé govorijo.</head> <figure xml:id="figure.3"> <figdesc>ornamentna sličica. Okrašena črka v.</figdesc> </figure> <p xml:id="p.7">v nedéljo po poldne je bilo in v Zlati Vasi so mlajši fantini in dekleta pod staro lipo sedéli in peli, ali pa se smejali, kadar jo je kdó iz pivnice prilomil, ki je pregloboko v kozarček polukal. Nekteri kmetje s svojimi ženami so pa v gostivnici sedéli in pri bokalu prav židane volje bili, kakor je že to navada, kadar sta vino in vol po ceni.</p> <p xml:id="p.8">kar jo primaha nék neznan človek v vas. Terdne in velike postave je bil in <choice> <sic>kakil</sic> <corr>kakih</corr> </choice> tridesét lét je mogel iméti; obléčen je bil v sivi suknji, na strani je imel veliko sabljo, na herbtu pa
9 The IMP digital library 9 DL on the Web We use (slightly modified) TEI XSLT stylesheets to convert TEI to HTML Each unit is one HTML file, showing both the facsimile and (typeset) transcription Indexes to books are by taxonomy, sorted by (one of) author title date signature One index also shows title pages
10 Example book: front The IMP digital library 10
11 Example book: body The IMP digital library 11
12 Example index The IMP digital library 12
13 Example index The IMP digital library 13
14 The IMP digital library 14 DL as corpus We have also developed a tool to automatically: 1. Tokenise the text (split it into words, punctuation and whitespace) 2. Modernise the words in the text 3. Tag the words with morphosyntactic descriptions 4. Lemmatise the words Example TEI encoding Ako se ne združimo tiga[tega] kužniga[kužnega]... <s> <w lemma="ako" ana="cs">ako</w><c> </c> <w lemma="se" ana="px------y">se</w><c> </c> <w lemma="ne" ana="q">ne</w><c> </c> <w lemma="združiti" ana="vmer1p">združimo</w><c> </c> <choice> <orig><w>tiga</w></orig> <reg><w lemma="ta" ana="pd-msg">tega</w></reg> </choice> <c> </c> <choice> <orig><w>kužniga</w></orig> <reg type="pattern" n="[ega@ iga@]"><w lemma="kužen" ana="agpmsg">kužnega</w></reg> </choice>
15 The IMP digital library 15 Concordancers The linguistically analysed corpus is made available on the web via two concordancers: nosketchengine: the OS version of the popular (and commercial) SketchEngine CUWI: our front-end to the well-known IMS CWB corpus workbench The concordancers offer: powerful search query syntax (REs over words and annotations) filters over meta-data (text types, year of publication, author, ) various sorting options over concordances construction of frequency lexica collocations saving results etc.
16 nosketchengine The IMP digital library 16
17 CUWI The IMP digital library 17
18 The IMP digital library 18 More than just concordances
19 The IMP digital library 19 Other IMP language resources 1. goo300k gold-standard corpus words, pages page-sampled from IMP DL manually annotated 2. IMP lexicon made on the basis of hand annotated corpus examples also encoded in TEI P5 for browsing on the Web (HTML) as data source for HLT application 3. ToTrTaLe the tool to linguistically analyse historical Slovene texts TEI P5 I/O utilises the IMP lexicon and transcription rules
20 IMP lexicon on the Web The IMP digital library 20
21 The IMP digital library 21 Size of IMP lexicon Size: what we count lemmas modern forms historical forms XL: everything L: words M: historical forms S: archaic words XS: word boundaries
22 The IMP digital library 22 Conclusions Presented the IMP DL, corpus, & lexicon of historical Slovene, available at for teaching / investigating Slovene history for diachronic linguistic investigations for development of HLT
23 The IMP digital library 23 Further work Currently: final clean-up of data Next: re-train ToTrTaLe, re-annotated corpus Offer more output formats: epub, PDF Extend > 1918 (Wikisource) HLT experiments: transcription rules, MT,
The IMP project: developing resources for historical Slovene
The IMP project: developing resources for historical Slovene Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana First ENeL workshop September 29 2014 Bled The IMP project 2
More informationAn Architecture for Editing Complex Digital Documents
An Architecture for Editing Complex Digital Documents Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Jamova cesta 39, Ljubljana, Slovenia tomaz.erjavec@ijs.si Summary In several on-going
More informationDigital Critical Editions of Slovenian Literature: an Application of Collaborative Work Using Open Standards
Digital Critical Editions of Slovenian Literature: an Application of Collaborative Work Using Open Standards Tomaž Erjavec 1, Matija Ogrin 2 1 Department of Knowledge Technologies, Jožef Stefan Institute
More informationPart A: Getting started 1. Open the <oxygen/> editor (with a blue icon, not the author mode with a red icon).
DIGITAL PUBLISHING AND PRESERVATION USING TEI http://www.lib.umich.edu/digital-publishing-production/digital-publishing-and-preservation-using-tei-november-13-2010 Introductory TEI encoding 1 This exercise
More informationE-SLOMŠEK: A TEI ENCODING OF A CRITICAL EDITION OF 19TH CENTURY SLOVENIAN RHETORIC PROSE
Преглед НЦД 5 (2004), 31 41 Tomaž Erjavec (Jožef Stefan Institute, Ljubljana) Matija Ogrin, Jože Faganel (Institute of Slovenian Literature, Ljubljana) E-SLOMŠEK: A TEI ENCODING OF A CRITICAL EDITION OF
More informationContents. List of Figures. List of Tables. Acknowledgements
Contents List of Figures List of Tables Acknowledgements xiii xv xvii 1 Introduction 1 1.1 Linguistic Data Analysis 3 1.1.1 What's data? 3 1.1.2 Forms of data 3 1.1.3 Collecting and analysing data 7 1.2
More informationText Encoding Fundamentals: Element list
Text Encoding Fundamentals: Element list Elements for basic TEI documents This is more of a brief reference sheet than an exhaustive list of TEI elements: it is intended to provide you with a way to look
More informationTranscription, Proofing, and Coding Protocols for Primary Works in the. By Mark L.Kamrath, Philip Barnard, Will Dorner, and Amy Giroux
Transcription, Proofing, and Coding Protocols for Primary Works in the Charles Brockden Brown Electronic Archive 1 By Mark L.Kamrath, Philip Barnard, Will Dorner, and Amy Giroux April 21, 2016 I.Textual
More informationThe XML encoding and digital tools of the ZTS edition
Elisa Veit, The ZTS mark-up manual pdf, in Swedish topelius.fi: Anvisningar 2 Facsimile Transcription Website High resolution In colour Colour targets 3 Marking up a letter in XML Transcription and encoding
More informationComp 336/436 - Markup Languages. Fall Semester Week 2. Dr Nick Hayward
Comp 336/436 - Markup Languages Fall Semester 2017 - Week 2 Dr Nick Hayward Digitisation - textual considerations comparable concerns with music in textual digitisation density of data is still a concern
More informationIntroducing XAIRA. Lou Burnard Tony Dodd. An XML aware tool for corpus indexing and searching. Research Technology Services, OUCS
Introducing XAIRA An XML aware tool for corpus indexing and searching Lou Burnard Tony Dodd Research Technology Services, OUCS What is XAIRA? XML Aware Indexing and Retrieval Architecture Developed from
More informationDigitally Preserving African Heritage
Digitally Preserving African Heritage Hussein Suleman hussein@cs.uct.ac.za University of Cape Town Department of Computer Science Centre for ICT for Development Digital Libraries Laboratory April 2016
More informationResearch Tools: DIY Text Tools
As with the other Research Tools, the DIY Text Tools are primarily designed for small research projects at the undergraduate level. What are the DIY Text Tools for? These tools are designed to help you
More informationParallel Concordancing and Translation. Michael Barlow
[Translating and the Computer 26, November 2004 [London: Aslib, 2004] Parallel Concordancing and Translation Michael Barlow Dept. of Applied Language Studies and Linguistics University of Auckland Auckland,
More informationGrowing interests in. Urgent needs of. Develop a fieldworkers toolkit (fwtk) for the research of endangered languages
ELPR IV International Conference 2002 Topics Reitaku University College of Foreign Languages Developing Tools for Creating-Maintaining-Analyzing Field Shoju CHIBA Reitaku University, Japan schiba@reitaku-u.ac.jp
More informationThe JOS morphosyntactically tagged corpus of Slovene
The JOS morphosyntactically tagged corpus of Slovene Tomaž Erjavec, Simon Krek Dept. of Knowledge Technologies, Jožef Stefan Institute Jamova cesta 39, SI-1000 Ljubljana, Slovenia tomaz.erjavec@ijs.si,
More informationEncoding Biomedical Resources in TEI: the Case of the GENIA Corpus
Encoding Biomedical Resources in TEI: the Case of the GENIA Corpus Tomaž Erjavec Dept. of Intelligent Systems Jožef Stefan Institute, Ljubljana Yuka Tateisi CREST Japan Science and Technology Corporation
More informationLING203: Corpus. March 9, 2009
LING203: Corpus March 9, 2009 Corpus A collection of machine readable texts SJSU LLD have many corpora http://linguistics.sjsu.edu/bin/view/public/chltcorpora Each corpus has a link to a description page
More informationData for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit
Data for linguistics ALEXIS DIMITRIADIS Text, corpora, and data in the wild 1. Where does language data come from? The usual: Introspection, questionnaires, etc. Corpora, suited to the domain of study:
More informationConverting and Representing Social Media Corpora into TEI: Schema and Best Practices from CLARIN-D
Converting and Representing Social Media Corpora into TEI: Schema and Best Practices from CLARIN-D Michael Beißwenger, Eric Ehrhardt, Axel Herold, Harald Lüngen, Angelika Storrer Background of this talk:
More informationThe Functional Extension Parser (FEP) A Document Understanding Platform
The Functional Extension Parser (FEP) A Document Understanding Platform Günter Mühlberger University of Innsbruck Department for German Language and Literature Studies Introduction A book is more than
More informationIntroduction to Text Mining. Aris Xanthos - University of Lausanne
Introduction to Text Mining Aris Xanthos - University of Lausanne Preliminary notes Presentation designed for a novice audience Text mining = text analysis = text analytics: using computational and quantitative
More informationRecent Developments in the Czech National Corpus
Recent Developments in the Czech National Corpus Michal Křen Charles University in Prague 3 rd Workshop on the Challenges in the Management of Large Corpora Lancaster 20 July 2015 Introduction of the project
More informationTowards an Independent Search Engine for Linguists: Issues and Solutions
Towards an Independent Search Engine for Linguists: Issues and Solutions La Rete come Corpus Forlì 14 January 2005 William H. Fletcher United States Naval Academy (2004-05 05 Radboud University of Nijmegen)
More informationEncoding of manuscripts using the TEI
Encoding of manuscripts using the TEI M. J. Driscoll Arnamagnæan Institute University of Copenhagen mjd@hum.ku.dk TEI Workshop Azbuky.net Sofia, Bulgaria 24 26 October 2005 Encoding primary sources The
More informationUsing Databases. What is a Database? Selecting a Database. Using Databases Published on E.J. Pratt Library (http://library.vicu.utoronto.
What is a Database? It is a research tool that contains specific types of literature not found in the library catalogue [1]. Depending on the scope, a database can be used to locate the following: journal
More informationHow to deposit your accepted paper in ORA through Symplectic
How to deposit your accepted paper in ORA through Symplectic Act on Acceptance: when you ve had a journal article or conference paper accepted for publication, deposit the accepted manuscript 1 into ORA
More informationUtilising ANNIS for search and analysis of historical data
Utilising ANNIS for search and analysis of historical data Stephan Druskat Thomas Krause Carolin Odebrecht Institut für deutsche Sprache und Linguistik Humboldt-Universität zu Berlin Reuse or New Development:
More informationMonk Datastore Workflow. September 28, 2009
Monk Datastore Workflow September 28, 2009 Seven Stages 1. Text selection 2. Text normalization 3. Morphological adornment 4. Bibliographic enhancement 5. Database input generation 6. Database creation
More informationLisa Biagini & Eugenio Picchi, Istituto di Linguistica CNR, Pisa
Lisa Biagini & Eugenio Picchi, Istituto di Linguistica CNR, Pisa Computazionale, INTERNET and DBT Abstract The advent of Internet has had enormous impact on working patterns and development in many scientific
More informationTEI and Databases. TEI and Databases. Øyvind Eide. June 2009
Øyvind Eide June 2009 Overview Introduction Short history Types of connections between TEI documents and databases Examples Conclusions Overview Introduction Short history Types of connections between
More informationText Mining. Representation of Text Documents
Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,
More informationMarkup Enhancement: Converting CEE. Dictionaries into TEI, and Beyond. Tomaz Erjavec.
Markup Enhancement: Converting CEE Dictionaries into TEI, and Beyond Tomaz Erjavec Tomaz.Erjavec@ijs.si Department of Intelligent Systems, Jozef Stefan Institute Jamova 39, SI-1000 Ljubljana, Slovenia
More informationConcorDance. A Simple Concordance Interface for Search Engines
KTH Stockholm October 26, 2005 Skolan för Datavetenskap och Kommunikation Numerisk analys och datalogi Course: 2D1418 Språkteknologi Autumn Term 2005 Course Instructor: Ola Knutsson ConcorDance A Simple
More informationVannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17
Information Retrieval Vannevar Bush Director of the Office of Scientific Research and Development (1941-1947) Vannevar Bush,1890-1974 End of WW2 - what next big challenge for scientists? 1 Historic Vision
More information2014/09/01 Workshop on Finite-State Language Resources Sofia. Local Grammars 1. Éric Laporte
2014/09/01 Workshop on Finite-State Language Resources Sofia Local Grammars 1 Éric Laporte Concordance Outline Local grammar of dates Invoking a subgraph Lexical masks Dictionaries of a text 01/09/2014
More informationAbstract. Background The File Description Title Statement Edition Statement
Core text identification for full-text databases Lisa A Lehman, Assistant Professor of Information Science, Rasmuson Library, University of Alaska Fairbanks John A. Lehman, Professor of Accounting and
More informationOrtolang Tools : MarsaTag
Ortolang Tools : MarsaTag Stéphane Rauzy, Philippe Blache, Grégoire de Montcheuil SECOND VARIAMU WORKSHOP LPL, Aix-en-Provence August 20th & 21st, 2014 ORTOLANG received a State aid under the «Investissements
More informationThe Corpus Thread Reference corpus of general language
The Corpus Thread Reference corpus of general language The complete documentation of the DK-CLARIN WP 2.1 Project Jørg Asmussen ja@dsl.dk Det Danske Sprog- og Litteraturselskab Society for Danish Language
More informationAutomatic Bangla Corpus Creation
Automatic Bangla Corpus Creation Asif Iqbal Sarkar, Dewan Shahriar Hossain Pavel and Mumit Khan BRAC University, Dhaka, Bangladesh asif@bracuniversity.net, pavel@bracuniversity.net, mumit@bracuniversity.net
More informationMetadata: The Theory Behind the Practice
Metadata: The Theory Behind the Practice Item Type Presentation Authors Coleman, Anita Sundaram Citation Metadata: The Theory Behind the Practice 2002-04, Download date 06/07/2018 12:18:20 Link to Item
More informationCreating Digital Scholarly Editions: An Introduction to the Text Encoding Initiative (TEI)
University of Michigan Deep Blue deepblue.lib.umich.edu 2011-03-19 Creating Digital Scholarly Editions: An Introduction to the Text Encoding Initiative (TEI) Welzenbach, Rebecca; Schaffner, Paul; Hawkins,
More informationThe PC And Gadget Help Desk: A Do-It-Yourself Guide To Troubleshooting And Repairing By Mark Edward Soper
The PC And Gadget Help Desk: A Do-It-Yourself Guide To Troubleshooting And Repairing By Mark Edward Soper If you are searching for a ebook The PC and Gadget Help Desk: A Do-It- Yourself Guide To Troubleshooting
More informationStandards for language encoding: Sharing resources
Standards for language encoding: Sharing resources Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute ESSLLI 2011 Sharing language resources Copyright Making information about resources
More informationA Language Research Workbench Software Architecture
A Language Research Workbench Software Architecture 47 Rooks December 2015 revised December 2016 Introduction I have used a number of high quality bible study software programs, Accordance, Logos, MySword
More informationSemantic media application with user created content to enhance enjoying cultural heritage
Semantic media application with user created content to enhance enjoying cultural heritage Sari Vainikainen, Asta Bäck, Pirjo Näkki Digital Semantic Content across Cultures the Louvre, Paris, May 4-5,
More informationActivity Report at SYSTRAN S.A.
Activity Report at SYSTRAN S.A. Pierre Senellart September 2003 September 2004 1 Introduction I present here work I have done as a software engineer with SYSTRAN. SYSTRAN is a leading company in machine
More informationPost Digitization: Challenges in Managing a Dynamic Dataset. Jasper Faase, 12 April 2012
Post Digitization: Challenges in Managing a Dynamic Dataset Jasper Faase, 12 April 2012 Post Digitization: Challenges in Managing a Dynamic Dataset Mission The Koninklijke Bibliotheek is the national library
More informationUnit 3 Corpus markup
Unit 3 Corpus markup 3.1 Introduction Data collected using a sampling frame as discussed in unit 2 forms a raw corpus. Yet such data typically needs to be processed before use. For example, spoken data
More informationCorpus Building with TEC Tools Version 2.2 (May 2011) Notes and Disclaimer 2
Corpus Building with TEC Tools Version 2.2 (May 2011) Contents Page Notes and Disclaimer 2 1. Scanning and Converting Images to Text 3 1.1 Scanning Documents 1.2 Choosing OCR Software 1.3 Extracting Text
More informationHow to Build a Digital Library
How to Build a Digital Library Ian H. Witten & David Bainbridge Contents Preface Acknowledgements i iv 1. Orientation: The world of digital libraries 1 One: Supporting human development 1 Two: Pushing
More informationENCODING TEXTS FOR VISUALIZATION AND ANALYSES USING THE TEI STANDARD
ENCODING TEXTS FOR VISUALIZATION AND ANALYSES USING THE TEI STANDARD William Chong http://dlinkup.com/workshops.html big D ata digitally R eady E ncoded A nalyzable M eaningful PRINCIPLES WORKSHOP ATTACK
More informationTerminologies, Knowledge Organization Systems, Ontologies
Terminologies, Knowledge Organization Systems, Ontologies Gerhard Budin University of Vienna TSS July 2012, Vienna Motivation and Purpose Knowledge Organization Systems In this unit of TSS 12, we focus
More informationWeb-Based Corpus Software
Web-Based Corpus Software CTS 03 Workshop/Tutorial - Pretoria, South Africa Saturnino Luz mailto:luzs@cs.tcd.ie Trinity College, Department of Computer Science 17th February 2004 Web-based corpus?? 2/75-1
More informationLou Burnard Consulting
Getting started with oxygen Lou Burnard Consulting 2014-06-21 1 Introducing oxygen In this first exercise we will use oxygen to : create a new XML document gradually add markup to the document carry out
More informationAutomated Tagging to Enable Fine-Grained Browsing of Lecture Videos
Automated Tagging to Enable Fine-Grained Browsing of Lecture Videos K.Vijaya Kumar (09305081) under the guidance of Prof. Sridhar Iyer June 28, 2011 1 / 66 Outline Outline 1 Introduction 2 Motivation 3
More informationHow to deposit your accepted paper in ORA through Symplectic
How to deposit your accepted paper in ORA through Symplectic Act on Acceptance: when you ve had a journal article or conference paper accepted for publication, deposit the accepted manuscript 1 into ORA
More informationSlovene Terminology Web Portal and the TBX-Compatible Simplified DTD/schema
Slovene Terminology Web Portal and the TBX-Compatible Simplified DTD/schema Simon Krek,* Vojko Gorjanc,** Špela Arhar,*** * Department for Knowledge Technologies, "Jožef Stefan" Institute, Jamova cesta
More informationSummary of Bird and Simons Best Practices
Summary of Bird and Simons Best Practices 6.1. CONTENT (1) COVERAGE Coverage addresses the comprehensiveness of the language documentation and the comprehensiveness of one s documentation of one s methodology.
More informationTEI, METS and ALTO, why we need all of them. Günter Mühlberger University of Innsbruck Digitisation and Digital Preservation
TEI, METS and ALTO, why we need all of them Günter Mühlberger University of Innsbruck Digitisation and Digital Preservation Agenda Introduction Problem statement Proposed solution Starting point Mass digitisation
More informationTEI-encoding for the Integrated Language Database of 8th - 21st-Century Dutch
POSTER SESSION TEI-encoding for the Integrated Language Database of 8th - 21st-Century Dutch Katrien Depuydt, Tilly Dutilh-Ruitenberg Instituut voor Nederlandse Lexicologie P.O.Box9515 NL-2300 PvA, Leiden
More informationA BNC-like corpus of American English
The American National Corpus Everything You Always Wanted To Know... And Weren t Afraid To Ask Nancy Ide Department of Computer Science Vassar College What is the? A BNC-like corpus of American English
More informationProcessing XML Text with Python and ElementTree a Practical Experience
Processing XML Text with Python and ElementTree a Practical Experience Radovan Garabík L udovít Štúr Institute of Linguistics Slovak Academy of Sciences Bratislava, Slovakia Abstract In this paper, we
More informationThe Turkish National Corpus (TNC): Comparing the Architectures of v1 and v2
The Turkish National Corpus (): Comparing the Architectures and Yeşim Aksan Selma Ayşe Özel Mersin University Mersin, Turkey yesimaksan@gmail.com Çukurova University Adana, Turkey saozel@gmail.com Hakan
More informationBringing Europeana and CLARIN together: Dissemination and exploitation of cultural heritage data in a research infrastructure
Bringing Europeana and CLARIN together: Dissemination and exploitation of cultural heritage data in a research infrastructure Twan Goosen 1 (CLARIN ERIC), Nuno Freire 2, Clemens Neudecker 3, Maria Eskevich
More informationIMPROVING YOUR JOURNAL WORKFLOW
IMPROVING YOUR JOURNAL WORKFLOW BEST PRACTICES FOR THE MODERN JOURNAL OFFICE IAN POTTER GLOBAL BUSINESS DEVELOPMENT MANAGER, PUBLISHING & ASSOCIATIONS THOMSON REUTERS BANDUNG, INDONESIA, THURSDAY 7TH APRIL
More informationInteractive Handwritten Text Recognition and Indexing of Historical Documents: the transcriptorum Project
Interactive Handwritten Text Recognition and ing of Historical Documents: the transcriptorum Project Alejandro H. Toselli ahector@prhlt.upv.es Pattern Recognition and Human Language Technology Reseach
More informationModeling Linguistic Research Data for a Repository for Historical Corpora. Carolin Odebrecht Humboldt-Universität zu Berlin LAUDATIO-repository.
Modeling Linguistic Research Data for a Repository for Historical Corpora Carolin Odebrecht Humboldt-Universität zu Berlin LAUDATIO-repository.org Motivation to enable the search for, the access of and
More informationComp 336/436 - Markup Languages. Fall Semester Week 4. Dr Nick Hayward
Comp 336/436 - Markup Languages Fall Semester 2018 - Week 4 Dr Nick Hayward XML - recap first version of XML became a W3C Recommendation in 1998 a useful format for data storage and exchange config files,
More informationHow to deposit your accepted paper in ORA through Symplectic
How to deposit your accepted paper in ORA through Symplectic Act on Acceptance: when you ve had a journal article or conference paper accepted for publication, deposit the accepted manuscript 1 into ORA
More informationMadCap Flare Training
MadCap Flare Training Course Overview Welcome Audience Course Overview Preparing Your Computer for the Course Flare Overview What Is Flare? Getting Around in Flare User Interface Ribbon or Toolbars Projects
More informationEuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates
EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates Alina Karakanta, Mihaela Vela, Elke Teich Department of Language Science and Technology, Saarland University Outline Introduction
More informationThe American National Corpus First Release
The American National Corpus First Release Nancy Ide and Keith Suderman Department of Computer Science, Vassar College, Poughkeepsie, NY 12604-0520 USA ide@cs.vassar.edu, suderman@cs.vassar.edu Abstract
More informationFinal Project Discussion. Adam Meyers Montclair State University
Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...
More informationSelf Introduction. Presentation Outline. College of Information 3/31/2016. Multilingual Information Access to Digital Collections
College of Information Multilingual Information Access to Digital Collections Jiangping Chen Http://coolt.lis.unt.edu/ Jiangping.chen@unt.edu April 20, 2016 Self Introduction An Associate Professor at
More informationlexidb: A Scalable Corpus Database Management System
lexidb: A Scalable Corpus Database Management System Matthew Coole, Paul Rayson and John Mariani Abstract lexidb is a scalable corpus database management system designed to fulfill corpus linguistics retrieval
More informationProf. Ahmet Süerdem Istanbul Bilgi University London School of Economics
Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics Media Intelligence Business intelligence (BI) Uses data mining techniques and tools for the transformation of raw data into meaningful
More informationXML Metadata Standards and Topic Maps
XML Metadata Standards and Topic Maps Erik Wilde 16.7.2001 XML Metadata Standards and Topic Maps 1 Outline what is XML? a syntax (not a data model!) what is the data model behind XML? XML Information Set
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval WS 2008/2009 25.11.2008 Information Systems Group Mohammed AbuJarour Contents 2 Basics of Information Retrieval (IR) Foundations: extensible Markup Language (XML)
More informationEffective searching strategies and techniques
Effective searching strategies and techniques Getting the most from electronic information resources Objectives To understand the importance of effective searching To develop guidelines for planning and
More informationExperimental Deployment of a Grid Virtual Organization for Human Language Technologies
Experimental Deployment of a Grid Virtual Organization for Human Language Technologies Jan Jona Javoršek, Tomaž Erjavec Jožef Stefan Institute Jamova ulica 39, SI-1000 Ljubljana, Slovenia jan.javorsek@ijs.si,
More informationBUDDHIST STONE SCRIPTURES FROM SHANDONG, CHINA
BUDDHIST STONE SCRIPTURES FROM SHANDONG, CHINA Heidelberg Academy of Sciences and Humanities Research Group Buddhist Stone Scriptures in China Hauptstraße 113 69117 Heidelberg Germany marnold@zo.uni-heidelberg.de
More informationDigitizing Historic Newspapers
Digitizing Historic Newspapers the University of Utah Way Presented by Scott Christensen iarchives, Inc. July 14, 2005 Agenda 3 Keys to a Quality Digitized Product Processing Methodology Q&A 3 Keys - Introduction
More informationSustainability of Text-Technological Resources
Sustainability of Text-Technological Resources Maik Stührenberg, Michael Beißwenger, Kai-Uwe Kühnberger, Harald Lüngen, Alexander Mehler, Dieter Metzing, Uwe Mönnich Research Group Text-Technological Overview
More informationA Register of Early Modern Slovenian Manuscripts
Journal of the Text Encoding Initiative Issue 4 2013 Selected Papers from the 2011 TEI Conference A Register of Early Modern Slovenian Manuscripts Matija Ogrin, Jan Jona Javoršek and Tomaž Erjavec Electronic
More informationUC Irvine Unicode Project
UC Irvine Unicode Project Title A proposal to encode New Testament editorial characters in the UCS Permalink https://escholarship.org/uc/item/6r10d7w1 Authors Pantelia, Maria C. Peevers, Richard Publication
More informationHistorical Text Mining:
Historical Text Mining Historical Text Mining, and Historical Text Mining: Challenges and Opportunities Dr. Robert Sanderson Dept. of Computer Science University of Liverpool azaroth@liv.ac.uk http://www.csc.liv.ac.uk/~azaroth/
More informationThe Mormon Diaries Project
The Mormon Diaries Project Scott Eldredge, Digital Initiatives Program Manager Harold B. Lee Library Frederick Zarndt, CTO iarchives What Is Transcription? Transcribe v.t. 1. To write over again; copy
More informationIntroduction to XML. Asst. Prof. Dr. Kanda Runapongsa Saikaew Dept. of Computer Engineering Khon Kaen University
Introduction to XML Asst. Prof. Dr. Kanda Runapongsa Saikaew Dept. of Computer Engineering Khon Kaen University http://gear.kku.ac.th/~krunapon/xmlws 1 Topics p What is XML? p Why XML? p Where does XML
More informationChrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO
Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO INDEX Proposal Recap Implementation Evaluation Future Works Proposal Recap Keyword Visualizer (chrome
More informationIn the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,
1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to
More informationScaling Out For Extreme Scale Corpus Data
Scaling Out For Extreme Scale Corpus Data Matthew Coole, Paul Rayson and John Mariani School of Computing and Communications Lancaster University Lancaster, Lancashire, UK m.coole@lancaster.ac.uk, p.rayson@lancaster.ac.uk,
More informationCoRoLa Starts Blooming An update on the Reference Corpus of Contemporary Romanian Language
CoRoLa Starts Blooming An update on the Reference Corpus of Contemporary Romanian Language Dan Tufiș, Verginica Barbu Mititelu, Elena Irimia, Ștefan Daniel Dumitrescu, Tiberiu Boroș Research Institute
More informationCONTENTdm & The Digital Collection Gateway New Looks for Discovery and Delivery
CONTENTdm & The Digital Collection Gateway New Looks for Discovery and Delivery EVERY CONNECTION has a starting point. OCLC EMEA Regional Council Meeting Deutsche Nationalbibliothek Frankfurt 2 nd March
More informationARKive-ERA Project Lessons and Thoughts
ARKive-ERA Project Lessons and Thoughts Semantic Web for Scientific and Cultural Organisations Convitto della Calza 17 th June 2003 Paul Shabajee (ILRT, University of Bristol) 1 Contents Context Digitisation
More informationRe-designing Online Terminology Resources for German Grammar
Re-designing Online Terminology Resources for German Grammar Project Report Karolina Suchowolec, Christian Lang, and Roman Schneider Institut für Deutsche Sprache (IDS), Mannheim, Germany {suchowolec,
More informationSemantics Isn t Easy Thoughts on the Way Forward
Semantics Isn t Easy Thoughts on the Way Forward NANCY IDE, VASSAR COLLEGE REBECCA PASSONNEAU, COLUMBIA UNIVERSITY COLLIN BAKER, ICSI/UC BERKELEY CHRISTIANE FELLBAUM, PRINCETON UNIVERSITY New York University
More informationMT+ Beneficiary Guide
MT+ Beneficiary Guide Introduction... 2 How to get access... 3 Login... 4 Automatic notifications... 8 Menu and Navigation... 9 List functionalities... 12 Project Details... 18 How to manage organisations...
More informationStandards for language encoding: ISO
Standards for language encoding: ISO Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute ESSLLI 2011 Overview of the lecture 1. How ISO works 2. ISO TC 37 3. Dates, times & languages 4.
More informationBulgarian Folk Songs in a Digital Library
N. Kirov 1,2 L. Peycheva 3 1 Department Informatics, New Bulgarian University 2 Institute of Mathematics and Informatics, Bulgarian Academy of Sciences 3 Institute for Ethnology and Folklore Studies with
More information