STS Infrastructural considerations. Christian Chiarcos

Similar documents
A Model-driven approach to NLP programming with UIMA

Importing MASC into the ANNIS linguistic database: A case study of mapping GrAF

Building the Multilingual Web of Data. Integrating NLP with Linked Data and RDF using the NLP Interchange Format

Reducing Consumer Uncertainty

Language Resources and Linked Data

TODO. LLOD and corpora

UIMA-based Annotation Type System for a Text Mining Architecture

MDA & Semantic Web Services Integrating SWSF & OWL with ODM

Knowledge-Driven Video Information Retrieval with LOD

Customisable Curation Workflows in Argo

CSc 8711 Report: OWL API

Text Mining for Software Engineering

Semantic Web for Earth and Environmental Terminology (SWEET) Status, Future Development and Community Building

Chapter 4 Research Prototype

ANC2Go: A Web Application for Customized Corpus Creation

Linked Open Data Cloud. John P. McCrae, Thierry Declerck

Towards an Integrated Information Framework for Service Technicians

Ontology Summit2007 Survey Response Analysis. Ken Baclawski Northeastern University

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

Chapter 13: Advanced topic 3 Web 3.0

CHAPTER 1 INTRODUCTION

Data is the new Oil (Ann Winblad)

THE GETTY VOCABULARIES TECHNICAL UPDATE

clarin:el an infrastructure for documenting, sharing and processing language data

Korean NLP2RDF Resources

Information Retrieval (IR) through Semantic Web (SW): An Overview

Implementing a Variety of Linguistic Annotations

Linked.Art & Vocabularies: Linked Open Usable Data

Ontology-based Architecture Documentation Approach

Semantic Data Extraction for B2B Integration

A Semantic Web-Based Approach for Harvesting Multilingual Textual. definitions from Wikipedia to support ICD-11 revision

Presented By Aditya R Joshi Neha Purohit

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

Reducing Consumer Uncertainty Towards a Vocabulary for User-centric Geospatial Metadata

Semantic agents for location-aware service provisioning in mobile networks

BUILDING THE SEMANTIC WEB

Semantic Technologies and CDISC Standards. Frederik Malfait, Information Architect, IMOS Consulting Scott Bahlavooni, Independent

The Semantic Web DEFINITIONS & APPLICATIONS

Migrating LINA Laboratory to Apache UIMA

The MEG Metadata Schemas Registry Schemas and Ontologies: building a Semantic Infrastructure for GRIDs and digital libraries Edinburgh, 16 May 2003

SEMANTIC WEB DATA MANAGEMENT. from Web 1.0 to Web 3.0

Languages and tools for building and using ontologies. Simon Jupp, James Malone

Annotation Science From Theory to Practice and Use Introduction A bit of history

An UIMA based Tool Suite for Semantic Text Processing

Knowledge Engineering with Semantic Web Technologies

From Open Data to Data- Intensive Science through CERIF

ELENA: Creating a Smart Space for Learning. Zoltán Miklós (presenter) Bernd Simon Vienna University of Economics

The NEPOMUK project. Dr. Ansgar Bernardi DFKI GmbH Kaiserslautern, Germany

Terminologies, Knowledge Organization Systems, Ontologies

Proposal for Implementing Linked Open Data on Libraries Catalogue

Towards the Semantic Desktop. Dr. Øyvind Hanssen University Library of Tromsø

Chapter 13 XML: Extensible Markup Language

COMP9321 Web Application Engineering

Making UIMA Truly Interoperable with SPARQL

Sustainability of Text-Technological Resources

COMP9321 Web Application Engineering

DBpedia Data Processing and Integration Tasks in UnifiedViews

A generic approach to manage metadata standards

case study The Asset Description Metadata Schema (ADMS) A common vocabulary to publish semantic interoperability assets on the Web July 2011

CSC 5930/9010: Text Mining GATE Developer Overview

What you have learned so far. Interoperability. Ontology heterogeneity. Being serious about the semantic web

Outline. 1 Introduction. 2 Semantic Assistants: NLP Web Services. 3 NLP for the Masses: Desktop Plug-Ins. 4 Conclusions. Why?

WHY WE NEED AN XML STANDARD FOR REPRESENTING BUSINESS RULES. Introduction. Production rules. Christian de Sainte Marie ILOG

COMPUTER AND INFORMATION SCIENCE JENA DB. Group Abhishek Kumar Harshvardhan Singh Abhisek Mohanty Suhas Tumkur Chandrashekhara

Ontology mutation testing

Lecture Telecooperation. D. Fensel Leopold-Franzens- Universität Innsbruck

Enrichment of Sensor Descriptions and Measurements Using Semantic Technologies. Student: Alexandra Moraru Mentor: Prof. Dr.

Metadata harmonization for fun and profit

Domain Specific Semantic Web Search Engine

Toward a Knowledge-Based Solution for Information Discovery in Complex and Dynamic Domains

Web Semantic Annotation Using Data-Extraction Ontologies

Taxonomy Tools: Collaboration, Creation & Integration. Dow Jones & Company

Semantics Isn t Easy Thoughts on the Way Forward

Linked Open Data: a short introduction

Instance linking. Leif Granholm Trimble Senior Vice President BIM Ambassador

Natural Language Processing with PoolParty

ISA Action 1.17: A Reusable INSPIRE Reference Platform (ARE3NA)

The Model-Driven Semantic Web Emerging Standards & Technologies

SEMANTIC SUPPORT FOR MEDICAL IMAGE SEARCH AND RETRIEVAL

DBpedia-An Advancement Towards Content Extraction From Wikipedia

A tutorial report for SENG Agent Based Software Engineering. Course Instructor: Dr. Behrouz H. Far. XML Tutorial.

Bridging the Gap between Semantic Web and Networked Sensors: A Position Paper

Linked Data and RDF. COMP60421 Sean Bechhofer

Introduction. October 5, Petr Křemen Introduction October 5, / 31

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

An e-infrastructure for Language Documentation on the Web

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

Racer: An OWL Reasoning Agent for the Semantic Web

Using ontologies function management

Ontology Extraction from Heterogeneous Documents

A SEMANTIC MATCHMAKER SERVICE ON THE GRID

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

ATC An OSGI-based Semantic Information Broker for Smart Environments. Paolo Azzoni Research Project Manager

Linked Data: What Now? Maine Library Association 2017

Semantic Web Update W3C RDF, OWL Standards, Development and Applications. Dave Beckett

Utilizing, creating and publishing Linked Open Data with the Thesaurus Management Tool PoolParty

OWL-DBC The Arrival of Scalable and Tractable OWL Reasoning for Enterprise Knowledge Bases

OKKAM-based instance level integration

Semantic Web Fundamentals

Background and Context for CLASP. Nancy Ide, Vassar College

Transcription:

STS Infrastructural considerations Christian Chiarcos chiarcos@uni-potsdam.de

Infrastructure Requirements Candidates standoff-based architecture (Stede et al. 2006, 2010) UiMA (Ferrucci and Lally 2004) RDF-based architecture (Hellmann 2010, Hellmann et al. 2012) Comparison

Requirements Flexibility support all necessary data structures, hierarchical, and relational Interoperability structural ( syntactic ) common exchange format for all modules conceptual ( semantic ) well-defined data categories clearly specified means to address them

Requirements Availability Can we build upon an existing architecture? Web Services Semantic modules using large knowledge bases should operate on their own servers Efficient interchange format Easy to parse, merge and write Performance

1. Standoff-based architecture e.g., SuMMAR/MOTS (Stede et al. 2006, 2010) pipeline architecture for high-quality text summarization syntax, coreference, text structure, causal markers, etc. standoff output of different modules to be combined these may also run in parallel exchange format PAULA standoff XML, derived from early (2004) drafts for the LAF

1. Architecture Merging Syntactical Analysis (Connexor) Structure Weight Calculation Discourse Marker Annotation Summary Calculation Layout Structure and Metadata Extraction Text Structure Extraction Coreference Analysis (Rosana) Term Weight Calculation Topic Segmentation Treetagger Number and Time Annotation Graphical Representation Final Modules Flexible Modules Tokenization and Sentence Boundary Detection Preprocessing Modules flexible modules can be arranged in any order in the pipeline or be processed nonsequentially standoff XML as common interchange format

1. Summarization pipeline Coreference Analysis (Rosana) Layout Structure and Metadata Extraction Syntactic Analysis (Connexor) Graphical Representation Text Structure Extraction Robust Morphosyntactic Analysis (TreeTagger) Summary Calculation Tokenization and Sentence Boundary Detection Term Weight Calculation Merging Preprocessing Modules Topic Segmentation Flexible Modules (selection) Final Modules

1. A fragment Coreference Analysis (Rosana) Layout Structure and Metadata Extraction Syntactic Analysis (Connexor)??? Graphical Representation Transforming Rosana output to PAULA Text Structure Extraction Robust Morphosyntactic Analysis (TreeTagger) Transforming relevant PAULA annotations to Connexor input format PAULA Summary Calculation Tokenization and Sentence PAULA Boundary Detection Preprocessing Modules Term Weight Calculation Topic Segmentation Flexible Modules Merging Merging multiple annotation layers in one PAULA project Final Modules one single PAULA project comprising annotations from different modules

1. Standoff XML advantages modularization trivial merge and split operations for annotations of the same document add another file to the annotation project clear conceptual separation of annotations disadvantages modules exchange information through XML relatively slow

2. UiMA (Ferruci and Lallas 2004) Unstructured Information Management Architecture Industry-scale architecture for NLP pipelines active community, good support Relatively generic data model with different realizations JAVA Objects, XML, others

2. UiMA Wrappers for various NLP tools available input and output representations of modules ( CAS consumers ) defined by annotation types e.g., a part-of-speech tag inventory different annotation type systems may not be compatible with each other => limited interoperability

2. UiMA advantages maturity rich technological ecosystem, active community efficiency supports, e.g., information exchange through JAVA objects disadvantages limited interoperability only how to implement a distributed architecture?

2. UiMA extensions Egner et al. (2007) UiMA Grid, distributed large-scale text analysis Verspoor et al. (2009) Abstracting the types away from a UiMA type system Ontologies instead of annotation types improved conceptual (`semantic ) interoperability less efficient indexing These extensions would have to be reimplemented for an STS pipeline AFAIK, not publicly available

3. RDF-based architecture Hellmann (2010), Hellmann et al. (2012) NLP Interchange Format (NIF) http://nlp2rdf.org/nif-1-0 NLP2RDF: RDF wrappers for various tools http://nlp2rdf.org provides NLP analyses for processing with Semantic Web tools applied in a large-scale European research project (LOD2) adopted by several external research groups

3. RDF Resource Description Framework W3C standard formalizes labeled directed multigraphs (like XML standoff formats) sublanguages define specialized vocabularies RDF Schema: concept hierarchies SKOS: semi-structured terminology bases OWL: ontologies

3. RDF different linearizations XML (verbose), Turtle (compact), others rich technological ecosystem data bases ( triple stores ) APIs and (syntactic) validators query language SPARQL OWL/DL despription logics defining and checking constraints (axioms) => formally defined user-specific data types

3. NLP2RDF

3. RDF advantages rich ecosystem, large and active community native support for distributed processing direct integration with LOD resources may be relevant for STS conceptual interoperability through linking with terminology repositories

Comparison standoff XML UiMA NLP2RDF flexibility + (+) + flexibility: + support for all necessary data structures (+) UiMA: multiple ways to represent trees

Comparison standoff XML UiMA NLP2RDF flexibility + + + structural interoperability + (+) + structural ( syntactic ) interoperability: + same format for all modules (+) UiMA: multiple ways to define trees

Comparison standoff XML UiMA NLP2RDF flexibility + + + structural interoperability conceptual interoperability + (+) + (-) (+) + conceptual ( semantic ) interoperability: + interoperability through reference to a terminology repository (+) UiMA: interoperability if the same annotation type system is used (-) standoff: links to terminology repositories can be provided, but no standard has been established to do so

Comparison standoff XML UiMA NLP2RDF flexibility + + + structural interoperability conceptual interoperability + (+) + (-) (+) + availability - (SuMMAR) + + availability: - unknown/restricted licence + open license

Comparison standoff XML UiMA NLP2RDF flexibility + + + structural interoperability conceptual interoperability + (+) + (-) (+) + availability - (SuMMAR) + + maturity (-) ++ + maturity: ++ industry-scale + used in multiple research groups (-) used in one research group

Comparison standoff XML UiMA NLP2RDF flexibility + + + structural interoperability conceptual interoperability + (+) + (-) (+) + availability - (SuMMAR) + + maturity (-) ++ + web services (+) (+) + support for distributed processing (web services): + available (+) possible

Comparison standoff XML UiMA NLP2RDF flexibility + + + structural interoperability conceptual interoperability + (+) + (-) (+) + availability - (SuMMAR) + + maturity (-) ++ + web services (+) (+) + performance/ efficiency - +/(+) (+) performance/efficiency + direct exchange of objects (without serialization) possible (+) compact serialization - verbose serialization

Todo: Rank criteria standoff XML UiMA NLP2RDF flexibility + + + structural interoperability conceptual interoperability + (+) + (-) (+) + availability - (SuMMAR) + + maturity (-) ++ + web services (+) (+) + performance/ efficiency - +/(+) (+) Which to chose? Combination of multiple architectures?