Formats and standards for metadata, coding and tagging. Paul Meurer

Similar documents
Background and Context for CLASP. Nancy Ide, Vassar College

Unit 3 Corpus markup

Implementing a Variety of Linguistic Annotations

Best practices in the design, creation and dissemination of speech corpora at The Language Archive

(Some) Standards in the Humanities. Sebastian Drude CLARIN ERIC RDA 4 th Plenary, Amsterdam September 2014

clarin:el an infrastructure for documenting, sharing and processing language data

CLARIN for Linguists Portal & Searching for Resources. Jan Odijk LOT Summerschool Nijmegen,

This document is a preview generated by EVS

Standards for Language Resources

XML Support for Annotated Language Resources

WebAnno: a flexible, web-based annotation tool for CLARIN

Janne Bondi johannessen, Anders Nøklestad, Joel Priestley and Kristin Hagen. WP5: Glossa Integration

UIMA-based Annotation Type System for a Text Mining Architecture

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Ortolang Tools : MarsaTag

Standards for Language Resources

1 Overview chart. PIDs: talk with EPIC PIDs: MoU or advice. Assessment wave 3. VLO overhaul CMDI 1.2

Final Project Discussion. Adam Meyers Montclair State University

Annotation by category - ELAN and ISO DCR

TectoMT: Modular NLP Framework

CORLI. a linguistic consortium for corpus, language and interaction

A tool for Cross-Language Pair Annotations: CLPA

Annotation Science From Theory to Practice and Use Introduction A bit of history

STS Infrastructural considerations. Christian Chiarcos

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

Converting the Corpus Query Language to the Natural Language

Towards a roadmap for standardization in language technology

The challenge of collecting and evaluating LRs for commercial use

The American National Corpus First Release

CMDI and granularity

Question Answering Using XML-Tagged Documents

Maca a configurable tool to integrate Polish morphological data. Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology

CSC 5930/9010: Text Mining GATE Developer Overview

How FAIR am I? FAIR Principles and Interoperability of Data and Tools

A Short Introduction to CATMA

Implementation of the Data Seal of Approval

Introducing XAIRA. Lou Burnard Tony Dodd. An XML aware tool for corpus indexing and searching. Research Technology Services, OUCS

Modeling Linguistic Research Data for a Repository for Historical Corpora. Carolin Odebrecht Humboldt-Universität zu Berlin LAUDATIO-repository.

An evaluation of three POS taggers for the tagging of the Tswana Learner English Corpus

Multi-level XML-based Corpus Annotation

Annotation and Feature Engineering

Intro to XML. Borrowed, with author s permission, from:

Representation and Interchange of Linguistic Annotation An In-Depth, Side-by-Side Comparison of Three Designs

Text Corpus Format (Version 0.4) Helmut Schmid SfS Tübingen / IMS StuDgart

Introduction to Lexical Functional Grammar. Wellformedness conditions on f- structures. Constraints on f-structures

EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates

ANC2Go: A Web Application for Customized Corpus Creation

Converting and Representing Social Media Corpora into TEI: Schema and Best Practices from CLARIN-D

Registry Interchange Format: Collections and Services (RIF-CS) explained

Conversion and Annotation Web Services for Spoken Language Data in CLARIN

RDF and Digital Libraries

A Tag For Punctuation

Information Extraction Techniques in Terrorism Surveillance

A Multilingual Social Media Linguistic Corpus

structure of the presentation Frame Semantics knowledge-representation in larger-scale structures the concept of frame

Specifying Syntax. An English Grammar. Components of a Grammar. Language Specification. Types of Grammars. 1. Terminal symbols or terminals, Σ

NLP Chain. Giuseppe Castellucci Web Mining & Retrieval a.a. 2013/2014

@Note2 tutorial. Hugo Costa Ruben Rodrigues Miguel Rocha

1. General requirements

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet

Story Workbench Quickstart Guide Version 1.2.0

Component Metadata Infrastructure Best Practices for CLARIN

Treex: Modular NLP Framework

NLP Final Project Fall 2015, Due Friday, December 18

Textual Emigration Analysis

ISO-standard Metadata Descriptors and Registries

Semantics Isn t Easy Thoughts on the Way Forward

Fully Delexicalized Contexts for Syntax-Based Word Embeddings

Constraints for corpora development and validation 1

META-SHARE metadata: Overview of the schema & Interoperability with other schemas

Working towards a Metadata Federation of CLARIN and DARIAH-DE

Advanced Topics in Information Retrieval Natural Language Processing for IR & IR Evaluation. ATIR April 28, 2016

Error annotation in adjective noun (AN) combinations

INF FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning, Lecture 4, 10.9

EUDAT-B2FIND A FAIR and Interdisciplinary Discovery Portal for Research Data

Using metadata for interoperability. CS 431 February 28, 2007 Carl Lagoze Cornell University

Automatic Metadata Extraction for Archival Description and Access

ACDH AUSTRIAN CENTRE FOR DIGITAL HUMANITIES

Introduction to Lexical Analysis

CLARIN Concept Registry: The New Semantic Registry

DuraSpace FAIRness and GDPR

Corpus Linguistics: corpus annotation

FLAT: A CLARIN-compatible repository solution based on Fedora Commons

Metadata and Encoding Standards for Digital Initiatives: An Introduction

Sustainability of Text-Technological Resources

Contents. List of Figures. List of Tables. Acknowledgements

Encoding standards for large text resources: The Text Encoding Initiative

Just for the record, CMDI should be about semantic interoperability

Assessing the FAIRness of Datasets in Trustworthy Digital Repositories: a 5 star scale

Encoding and Designing for the Swift Poems Project

Argument Structures and Semantic Roles: Actual State in ISO TC37/SC4 TDG 3

Natural Language Processing: Programming Project II PoS tagging

Standards for language encoding: Sharing resources

This document is a preliminary proposal to encode two characters into Unicode.

Building metadata components

NLP AND ONTOLOGY MATCHING: A SUCCESSFUL COMBINATION FOR TRIALOGICAL LEARNING

Computer Science Applications to Cultural Heritage. Metadata

FAIR-aligned Scientific Repositories: Essential Infrastructure for Open and FAIR Data

How to.. What is the point of it?

TTNWW. TST tools voor het Nederlands als Webservices in een Workflow. Marc Kemps-Snijders

Transcription:

Formats and standards for metadata, coding and tagging Paul Meurer

The FAIR principles FAIR principles for resources (data and metadata): Findable (-> persistent identifier, metadata, registered/indexed) Accessible (-> retrievable by pid, standardized protocol; metadata always accessible) Interoperable (use formal, accessible, shared, broadly applicable language for knowledge representation; vocabularies following FAIR principles) Reusable (metadata have accurate and relevant attributes; clear license; information on provenance; metadata meet domain relevant community standards)

CLARIN: CMDI metadata CMDI (Component MetaData Infrastructure): XML-based metadata format (ISO standard) to describe a resource and make it findable CMDI templates to choose from Persistent identifiers Standardized hierarchical description Makes resource findable through e.g. VLO (Virtual Language Observatory)

L2 corpora, ASK L2 corpora: Particularly important to link text to information about the learner and the text production task context -> encoding of core (learner and task) metadata Can be achieved by e.g. using XML/TEI. XML: machine-independent text-based coding standard, much used to encode corpus data TEI: XML-based guidelines for structural encoding with meaningful (semantic) markup

Encoding format: TEI TEI header: encoding of metadata Person- and task-related metadata is encoded in profiledesc/particdesc/person, as a list of <p> elements: <profiledesc> <particdesc> <person> <p id="01" n="pid">h0004</p> <p id="02" n="testyear">2000</p> <p id="03" n="testlevel">høyere nivå</p> <p id="04" n="country">nederland</p> <p id="05" n="language">nederlandsk</p> <p id="06" n="age">45</p> <p id="07" n="gender">kvinne</p>

Encoding: possible improvements Should stick to English or multilingual attributes and values Use a restricted vocabulary for atts and values either agreed-on in the L2 community, or based on some public external vocabulary like OpenSKOS Using <p> elements to encode flat attribute-value pairs seems OK, since TEI does not provide structured elements that fit our purpose. They are also straight-forward to feed into corpus applications But distinguishing person, task-related and bookkeeping metadata could be advantageous Also attributes that are constant for a given corpus/ project should be encoded

Error encoding in ASK Encoding device: <sic> elements, with attributes type, desc and corr. <sic type="f" desc="agr" corr="stilling">stillingen</sic> <sic> elements can nest: <sic type="o">før var kvinner <sic type="infl">undertrukket</sic> </sic> Improvements: Better attribute names ( desc means subtype) Different error classification/granularity?

Grammatical annotation ASK: Text is tokenized and split into sentences Morphosyntactic annotation with POS, morphological features, syntactic relations (Obs.: morphosyntactic annotation is not always accurate, is based on corrected version; it should only be used to guide your searches) <word lemma="dette" features="pron nøyt ent pers @subj"> dette </word>

Grammatical annotation: improvements ASK: Bag of tags, Norwegian names for grammatical features (taken from Oslo-Bergen Tagger) Other corpora: Šolar: array of characters, each position encoding one feature (e.g., Sometn ). Cryptic. Teitok: POS only? But: Better interoperability obtained using standardized tags Possible choices: Eagles tagset (outdated?) Universal dependency POS, feature set and relations e.g., VERB tense=past Some bag of tags solution Advantage of bag of tags: easy to use in corpus tool (but less selfexplanatory)

ASK: Addressing of text positions Sentence encoding: <s sid="h0004:s19"> Document id (pid), sentence id (sid) and word position in sentence are used as reference points to uniquely address a word in a text. Address stays the same also when grammatical or error coding is changed. Important for user-generated corpus annotation when corpus is reindexed

User-generated annotation Annotate phenomena that are not coded in the corpus and cannot be automatically searched for Two variants: Free-text annotation Annotation with a restricted vocabulary (feature-value pairs) Important: ability to search for annotations

Annotation: use case Annotate stranded prepositions (simplified) Devise an annotation feature stranded with two values: yes, no Search for prepositions followed by comma or full stop Go through list in KWIC and classify Search like [feature=("stranded:yes")] etc.