Towards a roadmap for standardization in language technology

Similar documents
(Some) Standards in the Humanities. Sebastian Drude CLARIN ERIC RDA 4 th Plenary, Amsterdam September 2014

Standards for Language Resources

TBX in ODD: Schema-agnostic specification and documentation for TermBase exchange

This document is a preview generated by EVS

ISO INTERNATIONAL STANDARD. Language resource management Feature structures Part 1: Feature structure representation

Annotation Science From Theory to Practice and Use Introduction A bit of history

XML Support for Annotated Language Resources

Standards for Language Resources

Metadata Proposals for Corpora and Lexica

Background and Context for CLASP. Nancy Ide, Vassar College

ISO/IEC Information technology Multimedia framework (MPEG-21) Part 3: Digital Item Identification

ISO INTERNATIONAL STANDARD. Language resources management Multilingual information framework

An e-infrastructure for Language Documentation on the Web

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia content description interface Part 5: Multimedia description schemes

TOWARDS INTERNATIONAL STANDARDS FOR LANGUAGE RESOURCES

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia framework (MPEG-21) Part 21: Media Contract Ontology

Annotation by category - ELAN and ISO DCR

ANC2Go: A Web Application for Customized Corpus Creation

Language resource management Semantic annotation framework (SemAF) Part 8: Semantic relations in discourse, core annotation schema (DR-core)

Delivery Context in MPEG-21

ISO/IEC INTERNATIONAL STANDARD. Information technology MPEG systems technologies Part 5: Bitstream Syntax Description Language (BSDL)

ISO/IEC Information technology Multimedia content description interface Part 7: Conformance testing

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia content description interface Part 1: Systems

ISO/IEC JTC 1 N 11274

This document is a preview generated by EVS

Unit 3 Corpus markup

Standards for language resources in ISO Looking back at 13 fruitful years

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia content description interface Part 2: Description definition language

Working Group Charter: Web Services Basic Profile

ISO/IEC JTC 1 Study Group on Smart Cities

The ISO/TMB Smart Cities Strategic Advisory Group (S_Cities SAG)

Standards for language encoding: ISO

Interoperability Standards Rationale for PAPI. Sean McGrath Technical Lead Pan African Parliamentary Interoperability Framework Initiative

Research resources and standardization : in the digital age

ESRI & Interoperability. David Danko ISO TC 211 Metadata Project Leader OGC Metadata WG Chair ESRI Senior Consultant GIS Standards

This document is a preview generated by EVS

COLDIC, a Lexicographic Platform for LMF Compliant Lexica

Replaces N 1758 ISO/IEC JTC 1/SC 35 N 1821 DATE: ISO/IEC JTC 1/SC 35. User Interfaces. Secretariat: AFNOR DOC TYPE: TITLE:

MPEG-21: The 21st Century Multimedia Framework

Growing interests in. Urgent needs of. Develop a fieldworkers toolkit (fwtk) for the research of endangered languages

This document is a preview generated by EVS

ISO-standard Metadata Descriptors and Registries

UIMA-based Annotation Type System for a Text Mining Architecture

ISO 2146 INTERNATIONAL STANDARD. Information and documentation Registry services for libraries and related organizations

ISO INTERNATIONAL STANDARD. Health informatics Harmonized data types for information interchange

Modeling LMF compliant lexica in OWL-DL

Interoperability Framework: The FLaReNet action plan proposal

Information technology Metamodel framework for interoperability (MFI) Part 1: Framework

Structure of presentation

Working Group Charter: Basic Profile 1.2 and 2.0

ISO/IEC INTERNATIONAL STANDARD

This document is a preview generated by EVS

ISO/IEC JTC 1 N 11326

Information Technology Metadata registries (MDR) Part 5: Naming and identification principles

Managing very large Multimedia Archives and their Integration into Federations

Semantics Isn t Easy Thoughts on the Way Forward

ISO/IEC INTERNATIONAL STANDARD

This document is a preview generated by EVS

clarin:el an infrastructure for documenting, sharing and processing language data

ISO/IEC INTERNATIONAL STANDARD. Information technology ECMAScript for XML (E4X) specification

ISO/IEC INTERNATIONAL STANDARD. Information technology EAN/UCC Application Identifiers and Fact Data Identifiers and Maintenance

ISO/IEC INTERNATIONAL STANDARD. Information technology Metamodel framework for interoperability (MFI) Part 1: Reference model

XLIFF 2.0. Dr. David Filip. Multilingual Web Workshop CNR PISA, April 4, 2011

ISO/IEC INTERNATIONAL STANDARD. Information technology Real-time locating systems (RTLS) Part 1: Application program interface (API)

MDA & Semantic Web Services Integrating SWSF & OWL with ODM

ISO INTERNATIONAL STANDARD. Document management Engineering document format using PDF Part 1: Use of PDF 1.6 (PDF/E-1)

ISO/IEC INTERNATIONAL STANDARD. Information technology JPEG 2000 image coding system Part 3: Motion JPEG 2000

The Power of Metadata Is Propelling Digital Imaging Beyond the Limitations of Conventional Photography

E-MELD Electronic Metastructure for Endangered Languages Documentation

ISO/TS TECHNICAL SPECIFICATION

ISO/IEC/Web3D Status Report

ISO/IEC INTERNATIONAL STANDARD. Information technology ASN.1 encoding rules: XML Encoding Rules (XER)

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia service platform technologies Part 3: Conformance and reference software

This document is a preview generated by EVS

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia service platform technologies Part 2: MPEG extensible middleware (MXM) API

MPEG-21 Current Work Plan

This document is a preview generated by EVS

ISO INTERNATIONAL STANDARD. Information and documentation Managing metadata for records Part 2: Conceptual and implementation issues

ISO INTERNATIONAL STANDARD. Translation-oriented terminography. Terminographie axée sur la traduction. First edition

Registry Interchange Format: Collections and Services (RIF-CS) explained

This document is a preview generated by EVS

ISO/IEC JTC 1 N Replaces: ISO/IEC JTC 1 Information Technology

Digitisation Standards

Introduction to the Controlled Trade Markup Language (CTML) Technical Committee

ISO/IEC INTERNATIONAL STANDARD. Information technology Metadata registries (MDR) Part 3: Registry metamodel and basic attributes

ISO/IEC INTERNATIONAL STANDARD. Information technology JPEG 2000 image coding system Part 14: XML representation and reference

Managing Learning Objects in Large Scale Courseware Authoring Studio 1

ISO/IEC INTERNATIONAL STANDARD. Information technology Multimedia application format (MPEG-A) Part 4: Musical slide show application format

ISO INTERNATIONAL STANDARD. Information and documentation Records management processes Metadata for records Part 1: Principles

The American National Corpus First Release

IPR Issues (2/2) Standardisation initiatives around Digital Rights Management. IPR Issues. Multimedia content. Representation: Metadata

CEN/ISSS WS/eCAT. Terminology for ecatalogues and Product Description and Classification

Taming the TEI Tiger 6. Lou Burnard June 2004

NISO STS (Standards Tag Suite) Differences Between ISO STS 1.1 and NISO STS 1.0. Version 1 October 2017

Summary of Bird and Simons Best Practices

ISO INTERNATIONAL STANDARD

ISO INTERNATIONAL STANDARD. Translation-oriented terminography. Terminographie axée sur la traduction. First edition

The main objective is to respond to an increasing need for coordination since there is a close relationship among multiple ISO security standards proj

Administrative Guideline. SMPTE Metadata Registers Maintenance and Publication SMPTE AG 18:2017. Table of Contents

Transcription:

Towards a roadmap for standardization in language technology Laurent Romary & Nancy Ide Loria-INRIA Vassar College

Overview General background on standardization Available standards On-going activities The work ahead of us

Standardization Defining methods or models to facilitate Exchange of data Interoperability between software components Comparability of results Involves From a technological point of view Stabilizing existing practices Looking ahead for potential roadblocks From an organizational point of view International consensus, long term availability and maintenance Vertical vs. horizontal standardization

Standards: a complex picture Official standardization bodies: National: AFNOR, ANSI, DIN, BSI, MSA International: ISO, IEC, CEN, W3C, OASIS Specific fora: Many! e.g.: TEI (Text Encoding Initiative) LISA (Localization Industry Standards Association) Projects with a pre-normative purpose: e.g. in EU: EAGLES, Multext,, MATE, ISLE

Existing standards (1) W3C (World Wide Web consortium); horizontal standards Basic building blocks: XML, XML Schemas (Note: growing importance of alternative RelaxNG schemas), XSL Web services activity WSDL, SOAP Semantic web activity RDF, RDFS, OWL Specific (vertical) activities with little critical mass VoiceML, EMMA, etc.

Existing standards (2) Relevant standards in ISO (partial view) Basic infrastructural (horizontal) standards Character encoding (cf. IPA): ISO 10646/Unicode Language codes: ISO 639 (e.g. fr ) and ISO 639-2 2 (e.g. fra fra / fre fre ) Note: under ISO/TC 37/SC 2 Vertical standards MPEG7 for multimedia information hardly implementable :-(: Terminology standards: ISO 12200 (Martif( Martif), ISO 12620 (Data categories), ISO 16642 (Terminological markup framework) Note: under ISO/TC 37/SC 3

Existing standards (3) Looking at other fields ISO-IEC/JTC IEC/JTC 1/SC 36: education Collaboration on language aspects ISO-IEC/JTC IEC/JTC 1/SC 32: databases Strong basis provided by ISO 11179 ISO-IEC/JTC IEC/JTC 1/SC??: evaluation of software ISO/IEC 9126-1 1 [2 & 3 in progress] ISO/IEC 14598-1 1 to 6

Existing standards (4) TEI proposals relevant for our field: TEI header: seminal work to evolve in collaboration with IMDI and OLAC Basic representation of texts: prose, poetry, drama, etc. Transcription of speech Print dictionaries: under revision in collaboration with ISO/TC 37/SC 4 (cf. LMF) Terminologies: under revision to make it compatible with ISO 16642

ISO committee on language resources ISO TC37 - Terminology and other language resources SC3 - Computer applications in terminology ISO 12200 - Martif Latest version of TEI Terminology chapter ISO 12620 - Data categories (under revision) ISO 16642 - TMF (Terminological Markup Framework) SC4 - Language Resource Management (May 2002) Sec.: K.-S. Choi,, Chair.: L. Romary http://www.tc37sc4.org

ISO/TC 37/SC 4 overall rationale Data categories WG2 Representation schemes WG4 Lexical databases WG3 Multilingual text representation WG1 Basic descriptors and mechanisms for language resources WG5 Workflow of language Resource Management

On-going activities within ISO/TC 37/SC 4 (1) Feature structure representation Joint activity with the TEI; CD document almost acheived; ; planned project on FS declaration Linguistic Annotation Framework E.g. principles of annotation scheme specification and representation, pointing mechanisms for stand-off mark- up; draft document available Morphosyntactic annotation framework Stable working draft under diissemination for evaluation Lexical Markup Framework (LMF) A general specification platform for lexical structures Preliminary proposals: core model + lexical extensions

On-going activities within ISO/TC 37/SC 4 (2) The central role of the Data Category Registry Objective: market place of descriptors for all types of language resources and annotation schemes E.g.: /grammatical gender/, /paucal/ number/, /ablative case/, etc. On-line tool available: http://syntax.loria loria.fr Three ad hoc groups created Metadata for language resources cf. TEI, IMDI, OLAC Morphosyntactic descriptors (SC4 plenary last Tuesday) Cf. Morphosyntactic Annotation Framework Semantic content descriptors Exploratory: discourse relations, dialogue acts, referential links, etc.

Priorities for the future (1) Stabilizing and disseminating Wide dissemination of existing standards Two priorities in ISO/TC 37/SC 4: morphosyntax and lexical structures Validation of on-going documents by our community Feedback on documents, reference implementations Gathering up samples and/or test suites (manpower needed) Organizing the work on the Data Category Registry Which additional topis should be addressed? How to involve a wide variety of experts? Specific publication and information days

Priorities for the future (2) Filling in the gaps: Syntactic structures: cf. Treebanks,, (Chunk, deep) Parsers Application specific lexica Which formats should be frozen within the LMF framework Semantic content representation Cf. ACL/SIGSEM working group on Multimodal semantic content representation

Priorities for the future (3) Open fields Multilingual information representation How to relate on-going activities on translation memories, localization,, itv, multimedia information (e.g. sub-titling) Evaluation of NLP components General principles: linguistic coverage, metrics Application specific evaluation methods: machine translation, information extraction Workflow of language resources The life cycle of language resources: creation, enrichment, validation, dissemination Sign languages

Conclusion Importance of dissemination of existing standards (in academia ) Standards as the identification of stable concepts in a field Introduction in academic curricula Importance of wide involvement of experts (academia and industry) Defining priorities Contribution to technical work Linking main milestones in the roadmap with the underlying standardization efforts E.g. Evaluation related standards