Semantics Isn t Easy Thoughts on the Way Forward

Size: px
Start display at page:

Download "Semantics Isn t Easy Thoughts on the Way Forward"

Transcription

1 Semantics Isn t Easy Thoughts on the Way Forward NANCY IDE, VASSAR COLLEGE REBECCA PASSONNEAU, COLUMBIA UNIVERSITY COLLIN BAKER, ICSI/UC BERKELEY CHRISTIANE FELLBAUM, PRINCETON UNIVERSITY New York University November 14-15, 2008 Approaches to Semantic Knowledge Acquisition Semi/un- supervised statistical methods to gather facts from unannotated corpora Smaller, (semi-)manual, detailed analyses using annotated resources Both approaches essential best leveraged if performed in close coordination

2 Semi/un-supervised learning Advantages Few resources required (inexpensive) Although may implicitly use throw-away annotations Implicitly reliant on a lot of knowledge Fast Can rapidly amass volumes of information Web provides vast amounts of language data Size overcomes noise? Reliability is a concern Results skewed by different varieties and genres, different types of speakers E.g. British vs. American English split in the frequency of a phenomenon E.g. inclusion of vast amounts of incorrect language or persistent non-native speaker errors unreliable information for ESL? Semi/un-supervised learning Limitations Assumes a (single) set of semantic facts (meanings, relations, etc.) that is stable and discoverable humans can agree on (90% of the time?) Web data unreliable Distributional features that are fundamentally unknown Left with a linguistic black box Variations due to genre, dialects, situation, context, when produced, etc. conflated Cannot capture fluid, dynamic, generative aspects of word and phrasal meaning We have not explored means to represent and process language that takes this into account

3 Analyses relying on annotated resources Advantages Overcomes disadvantages of unsupervised / unannotated approaches Can get at the more fluid and dynamic aspects of language Can examine impact of genre, situation, context, dialect, etc. by contolling corpus content Heavily annotated data enables exploration of interrelations among linguistic layers A critical next step for NLP research Analyses relying on annotated resources Disadvantages Expensive! Costly to manually or even semi-manually produce reliable language data and annotations Slow Manual work takes time

4 MASC Manually Annotated Sub-Corpus NSF-funded project to provide a sharable, reusable annotated resource with rich linguistic annotations Texts from wide range of genres Manual annotations or manually-validated annotations for multiple levels WordNet senses FrameNet frames and frame elements shallow parses named entities Enables linking WordNet senses and FrameNet frames into more complex semantic structures Enriches semantic and pragmatic information Detailed inter-annotator agreement measures Contents Texts drawn from Open ANC Freely distributable portions of LU Corpus Subset of Wall Street Journal texts that have been heavily annotated by multiple projects Several genres Written (travel guides, blog, fiction, letters, newspaper, non-fiction, technical, journal, government documents, court transcript) Spoken (face-to-face, academic, telephone) Free of license restrictions, redistributable All MASC data and annotations freely downloadable from ANC website (

5 Annotation Process Smaller portions of the sub-corpus manually annotated for specific phenomena Maintain representativeness Include as many annotations of different types as possible Apply (semi)-automatic annotation techniques to determine the reliability of their results Study inter-annotator agreement on manually-produced annotations Determine benchmark of accuracy Fine-tune annotator guidelines Consider if accurate annotations for one phenomenon can improve performance of automatic annotation systems for another E.G., Validated WN sense tags and noun chunks may improve automatic semantic role labeling Process (continued) Apply iterative process to maximize performance of automatic taggers Manual annotation Retrain automatic annotation software Improved annotation software later applied to the entire OANC Provide more accurate automatically-produced annotation of full corpus

6 Representation ISO TC37 SC4 Linguistic Annotation Framework Graph of feature structures (GrAF) isomorphic to other feature structure-based representations (e.g. UIMA CAS) Each annotation in a separate stand-off document linked to primary data or other annotations Merge annotations with ANC API Output in any of several formats XML non-xml for use with systems such as NLTK and concordancing tools UIMA CAS Input to GraphViz ANC Pipeline Automatically annotate Merge some or all annotations Texts in different formats ANC processing primary data Annotations as graph of feature structures in stand-off XML documents ANC Tool Input to UIMA Input to GraphViz Input to NLTK others...

7 Transduction Different annotation formats Transduce to GrAF Merge PTB PropBank NomBank PDTB TimeBank Alignment of Lexical Resources Concurrent NSF-funded project investigating how and to what extent WordNet and FrameNet can be aligned MASC annotations of FrameNet frames and frame elements and WordNet senses provide a ready-made testing ground

8 Goals Continually augment MASC with contributed annotations from the research community Discourse structure, additional entities, events, opinions, etc. Distribution of effort and integration of currently independent resources such as the ANC, WordNet, and FrameNet will enable progress in resource development Less cost No duplication of effort Greater degree of accuracy and usability Harmonization MASC can serve as a model for community effort to develop required methods and resources to further NLP research MASC Will be the largest semantically annotated corpus of English in existence Should have a major impact on the speed with which similar resources can be reliably annotated WN and FN annotation of the MASC will immediately create a massive multi-lingual resource network Both WN and FN linked to corresponding resources in other languages No existing resource approaches this scope Because it enables merging annotations at different linguistic levels, will facilitate a deeper investigation of interactions among linguistic phenomena contribute to better understanding of the workings of language at the semantic level

9 Recommendations and Conclusions Pursue automatic acquisition efforts and manual resource creation, annotation, and analysis in parallel Automatic acquisition can get us to the ~80% celing Manual effort can get us the other 20% Embrace the need to render the knowledge resources created by automatic acquisition in a form and format that can interoperate with annotations and other resources NLP community does not need yet-another-independentresource that is difficult or impossible to use with other resources PS OANC available at 1st set of MASC data (~120K words) should be available by end of year, augmented regularly after that We encourage contributions of annotations (automatic or manual) of MASC and/or OANC data for any linguistic phenomenon, in any format We will do the transduction to GrAF

ANC2Go: A Web Application for Customized Corpus Creation

ANC2Go: A Web Application for Customized Corpus Creation ANC2Go: A Web Application for Customized Corpus Creation Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science, Vassar College Poughkeepsie, New York 12604 USA {ide, suderman, brsimms}@cs.vassar.edu

More information

MASC: The Manually Annotated Sub-Corpus of American English. Nancy Ide*, Collin Baker**, Christiane Fellbaum, Charles Fillmore**, Rebecca Passonneau

MASC: The Manually Annotated Sub-Corpus of American English. Nancy Ide*, Collin Baker**, Christiane Fellbaum, Charles Fillmore**, Rebecca Passonneau MASC: The Manually Annotated Sub-Corpus of American English Nancy Ide*, Collin Baker**, Christiane Fellbaum, Charles Fillmore**, Rebecca Passonneau *Vassar College Poughkeepsie, New York USA **International

More information

Background and Context for CLASP. Nancy Ide, Vassar College

Background and Context for CLASP. Nancy Ide, Vassar College Background and Context for CLASP Nancy Ide, Vassar College The Situation Standards efforts have been on-going for over 20 years Interest and activity mainly in Europe in 90 s and early 2000 s Text Encoding

More information

MASC: A Community Resource For and By the People

MASC: A Community Resource For and By the People MASC: A Community Resource For and By the People Nancy Ide Department of Computer Science Vassar College Poughkeepsie, NY, USA ide@cs.vassar.edu Christiane Fellbaum Princeton University Princeton, New

More information

DEVELOPING LINGUISTIC RESOURCES WITH THE ANC REPORT ON THE NSF-FUNDED WORKSHOP

DEVELOPING LINGUISTIC RESOURCES WITH THE ANC REPORT ON THE NSF-FUNDED WORKSHOP DEVELOPING LINGUISTIC RESOURCES WITH THE ANC REPORT ON THE NSF-FUNDED WORKSHOP Nany Ide, Vassar College Christiane Fellbaum, Princeton University 1 Introduction An NSF-funded workshop on Developing Linguistic

More information

An Open Linguistic Infrastructure for Annotated Corpora

An Open Linguistic Infrastructure for Annotated Corpora An Open Linguistic Infrastructure for Annotated Corpora Nancy Ide 1 Introduction Annotated corpora are a fundamental resource for research and development in the field of natural language processing (NLP).

More information

Final Project Discussion. Adam Meyers Montclair State University

Final Project Discussion. Adam Meyers Montclair State University Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...

More information

Importing MASC into the ANNIS linguistic database: A case study of mapping GrAF

Importing MASC into the ANNIS linguistic database: A case study of mapping GrAF Importing MASC into the ANNIS linguistic database: A case study of mapping GrAF Arne Neumann 1 Nancy Ide 2 Manfred Stede 1 1 EB Cognitive Science and SFB 632 University of Potsdam 2 Department of Computer

More information

Annotation Science From Theory to Practice and Use Introduction A bit of history

Annotation Science From Theory to Practice and Use Introduction A bit of history Annotation Science From Theory to Practice and Use Nancy Ide Department of Computer Science Vassar College Poughkeepsie, New York 12604 USA ide@cs.vassar.edu Introduction Linguistically-annotated corpora

More information

Generating FrameNets of various granularities: The FrameNet Transformer

Generating FrameNets of various granularities: The FrameNet Transformer Generating FrameNets of various granularities: The FrameNet Transformer Josef Ruppenhofer, Jonas Sunde, & Manfred Pinkal Saarland University LREC, May 2010 Ruppenhofer, Sunde, Pinkal (Saarland U.) Generating

More information

Importing MASC into the ANNIS linguistic database: A case study of mapping GrAF

Importing MASC into the ANNIS linguistic database: A case study of mapping GrAF Importing MASC into the ANNIS linguistic database: A case study of mapping GrAF Arne Neumann EB Cognitive Science and SFB 632 University of Potsdam neumana@uni-potsdam.de Nancy Ide Department of Computer

More information

A BNC-like corpus of American English

A BNC-like corpus of American English The American National Corpus Everything You Always Wanted To Know... And Weren t Afraid To Ask Nancy Ide Department of Computer Science Vassar College What is the? A BNC-like corpus of American English

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Bridging the Gaps. Interoperability for Language Engineering Architectures Using GrAF. Noname manuscript No. (will be inserted by the editor)

Bridging the Gaps. Interoperability for Language Engineering Architectures Using GrAF. Noname manuscript No. (will be inserted by the editor) Noname manuscript No. (will be inserted by the editor) Bridging the Gaps Interoperability for Language Engineering Architectures Using GrAF Nancy Ide Keith Suderman Received: date / Accepted: date Abstract

More information

The American National Corpus First Release

The American National Corpus First Release The American National Corpus First Release Nancy Ide and Keith Suderman Department of Computer Science, Vassar College, Poughkeepsie, NY 12604-0520 USA ide@cs.vassar.edu, suderman@cs.vassar.edu Abstract

More information

Data for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit

Data for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit Data for linguistics ALEXIS DIMITRIADIS Text, corpora, and data in the wild 1. Where does language data come from? The usual: Introspection, questionnaires, etc. Corpora, suited to the domain of study:

More information

Corpus Linguistics: corpus annotation

Corpus Linguistics: corpus annotation Corpus Linguistics: corpus annotation Karën Fort karen.fort@inist.fr November 30, 2010 Introduction Methodology Annotation Issues Annotation Formats From Formats to Schemes Sources Most of this course

More information

structure of the presentation Frame Semantics knowledge-representation in larger-scale structures the concept of frame

structure of the presentation Frame Semantics knowledge-representation in larger-scale structures the concept of frame structure of the presentation Frame Semantics semantic characterisation of situations or states of affairs 1. introduction (partially taken from a presentation of Markus Egg): i. what is a frame supposed

More information

Making Sense Out of the Web

Making Sense Out of the Web Making Sense Out of the Web Rada Mihalcea University of North Texas Department of Computer Science rada@cs.unt.edu Abstract. In the past few years, we have witnessed a tremendous growth of the World Wide

More information

Bridging the Gaps: Interoperability for GrAF, GATE, and UIMA

Bridging the Gaps: Interoperability for GrAF, GATE, and UIMA Bridging the Gaps: Interoperability for GrAF, GATE, and UIMA Nancy Ide Department of Computer Science Vassar College Poughkeepsie, New York USA ide@cs.vassar.edu Keith Suderman Department of Computer Science

More information

STS Infrastructural considerations. Christian Chiarcos

STS Infrastructural considerations. Christian Chiarcos STS Infrastructural considerations Christian Chiarcos chiarcos@uni-potsdam.de Infrastructure Requirements Candidates standoff-based architecture (Stede et al. 2006, 2010) UiMA (Ferrucci and Lally 2004)

More information

Sustainability of Text-Technological Resources

Sustainability of Text-Technological Resources Sustainability of Text-Technological Resources Maik Stührenberg, Michael Beißwenger, Kai-Uwe Kühnberger, Harald Lüngen, Alexander Mehler, Dieter Metzing, Uwe Mönnich Research Group Text-Technological Overview

More information

Frame Semantic Structure Extraction

Frame Semantic Structure Extraction Frame Semantic Structure Extraction Organizing team: Collin Baker, Michael Ellsworth (International Computer Science Institute, Berkeley), Katrin Erk(U Texas, Austin) October 4, 2006 1 Description of task

More information

English Understanding: From Annotations to AMRs

English Understanding: From Annotations to AMRs English Understanding: From Annotations to AMRs Nathan Schneider August 28, 2012 :: ISI NLP Group :: Summer Internship Project Presentation 1 Current state of the art: syntax-based MT Hierarchical/syntactic

More information

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ???

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ??? @ INSIDE DEEPQA Managing complex unstructured data with UIMA Simon Ellis INTRODUCTION 22 nd November, 2013 WAT SON TECHNOLOGIES AND OPEN ARCHIT ECT URE QUEST ION ANSWERING PROFESSOR JIM HENDLER S IMON

More information

Corpus Linguistics for NLP APLN550. Adam Meyers Montclair State University 9/22/2014 and 9/29/2014

Corpus Linguistics for NLP APLN550. Adam Meyers Montclair State University 9/22/2014 and 9/29/2014 Corpus Linguistics for NLP APLN550 Adam Meyers Montclair State University 9/22/ and 9/29/ Text Corpora in NLP Corpus Selection Corpus Annotation: Purpose Representation Issues Linguistic Methods Measuring

More information

(Some) Standards in the Humanities. Sebastian Drude CLARIN ERIC RDA 4 th Plenary, Amsterdam September 2014

(Some) Standards in the Humanities. Sebastian Drude CLARIN ERIC RDA 4 th Plenary, Amsterdam September 2014 (Some) Standards in the Humanities Sebastian Drude CLARIN ERIC RDA 4 th Plenary, Amsterdam September 2014 1. Introduction Overview 2. Written text: the Text Encoding Initiative (TEI) 3. Multimodal: ELAN

More information

Data and Information Integration: Information Extraction

Data and Information Integration: Information Extraction International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Data and Information Integration: Information Extraction Varnica Verma 1 1 (Department of Computer Science Engineering, Guru Nanak

More information

An UIMA based Tool Suite for Semantic Text Processing

An UIMA based Tool Suite for Semantic Text Processing An UIMA based Tool Suite for Semantic Text Processing Katrin Tomanek, Ekaterina Buyko, Udo Hahn Jena University Language & Information Engineering Lab StemNet Knowledge Management for Immunology in life

More information

Towards a roadmap for standardization in language technology

Towards a roadmap for standardization in language technology Towards a roadmap for standardization in language technology Laurent Romary & Nancy Ide Loria-INRIA Vassar College Overview General background on standardization Available standards On-going activities

More information

XML Support for Annotated Language Resources

XML Support for Annotated Language Resources XML Support for Annotated Language Resources Nancy Ide Department of Computer Science Vassar College Poughkeepsie, New York USA ide@cs.vassar.edu Laurent Romary Equipe Langue et Dialogue LORIA/CNRS Vandoeuvre-lès-Nancy,

More information

UIMA-based Annotation Type System for a Text Mining Architecture

UIMA-based Annotation Type System for a Text Mining Architecture UIMA-based Annotation Type System for a Text Mining Architecture Udo Hahn, Ekaterina Buyko, Katrin Tomanek, Scott Piao, Yoshimasa Tsuruoka, John McNaught, Sophia Ananiadou Jena University Language and

More information

Question Answering Using XML-Tagged Documents

Question Answering Using XML-Tagged Documents Question Answering Using XML-Tagged Documents Ken Litkowski ken@clres.com http://www.clres.com http://www.clres.com/trec11/index.html XML QA System P Full text processing of TREC top 20 documents Sentence

More information

SemEval-2013 Task 13: Word Sense Induction for Graded and Non-Graded Senses

SemEval-2013 Task 13: Word Sense Induction for Graded and Non-Graded Senses SemEval-2013 Task 13: Word Sense Induction for Graded and Non-Graded Senses David Jurgens Dipartimento di Informatica Sapienza Universita di Roma jurgens@di.uniroma1.it Ioannis Klapaftis Search Technology

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Discriminative Training with Perceptron Algorithm for POS Tagging Task Discriminative Training with Perceptron Algorithm for POS Tagging Task Mahsa Yarmohammadi Center for Spoken Language Understanding Oregon Health & Science University Portland, Oregon yarmoham@ohsu.edu

More information

A Semantic Role Repository Linking FrameNet and WordNet

A Semantic Role Repository Linking FrameNet and WordNet A Semantic Role Repository Linking FrameNet and WordNet Volha Bryl, Irina Sergienya, Sara Tonelli, Claudio Giuliano {bryl,sergienya,satonelli,giuliano}@fbk.eu Fondazione Bruno Kessler, Trento, Italy Abstract

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

Using UIMA to Structure an Open Platform for Textual Entailment. Tae-Gil Noh, Sebastian Padó Dept. of Computational Linguistics Heidelberg University

Using UIMA to Structure an Open Platform for Textual Entailment. Tae-Gil Noh, Sebastian Padó Dept. of Computational Linguistics Heidelberg University Using UIMA to Structure an Open Platform for Textual Entailment Tae-Gil Noh, Sebastian Padó Dept. of Computational Linguistics Heidelberg University The paper is about About EXCITEMENT Open Platform a

More information

Migrating LINA Laboratory to Apache UIMA

Migrating LINA Laboratory to Apache UIMA Migrating LINA Laboratory to Apache UIMA Stegos Afantenos et Matthieu Vernier Équipe TALN - Laboratoire Informatique Nantes Atlantique Vendredi 10 Juillet 2009 Afantenos, Vernier (TALN - LINA) UIMA @ LINA

More information

Error annotation in adjective noun (AN) combinations

Error annotation in adjective noun (AN) combinations Error annotation in adjective noun (AN) combinations This document describes the annotation scheme devised for annotating errors in AN combinations and explains how the inter-annotator agreement has been

More information

Annotating Spatio-Temporal Information in Documents

Annotating Spatio-Temporal Information in Documents Annotating Spatio-Temporal Information in Documents Jannik Strötgen University of Heidelberg Institute of Computer Science Database Systems Research Group http://dbs.ifi.uni-heidelberg.de stroetgen@uni-hd.de

More information

A platform for collaborative semantic annotation

A platform for collaborative semantic annotation A platform for collaborative semantic annotation Valerio Basile and Johan Bos and Kilian Evang and Noortje Venhuizen {v.basile,johan.bos,k.evang,n.j.venhuizen}@rug.nl Center for Language and Cognition

More information

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP IJIRST International Journal for Innovative Research in Science & Technology Volume 2 Issue 02 July 2015 ISSN (online): 2349-6010 A Hybrid Unsupervised Web Data Extraction using Trinity and NLP Anju R

More information

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Donald C. Comeau *, Haibin Liu, Rezarta Islamaj Doğan and W. John Wilbur National Center

More information

Corpus methods for sociolinguistics. Emily M. Bender NWAV 31 - October 10, 2002

Corpus methods for sociolinguistics. Emily M. Bender NWAV 31 - October 10, 2002 Corpus methods for sociolinguistics Emily M. Bender bender@csli.stanford.edu NWAV 31 - October 10, 2002 Overview Introduction Corpora of interest Software for accessing and analyzing corpora (demo) Basic

More information

Language resource management Semantic annotation framework (SemAF) Part 8: Semantic relations in discourse, core annotation schema (DR-core)

Language resource management Semantic annotation framework (SemAF) Part 8: Semantic relations in discourse, core annotation schema (DR-core) INTERNATIONAL STANDARD ISO 24617-8 First edition 2016-12-15 Language resource management Semantic annotation framework (SemAF) Part 8: Semantic relations in discourse, core annotation schema (DR-core)

More information

A Multilingual Social Media Linguistic Corpus

A Multilingual Social Media Linguistic Corpus A Multilingual Social Media Linguistic Corpus Luis Rei 1,2 Dunja Mladenić 1,2 Simon Krek 1 1 Artificial Intelligence Laboratory Jožef Stefan Institute 2 Jožef Stefan International Postgraduate School 4th

More information

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet Joerg-Uwe Kietz, Alexander Maedche, Raphael Volz Swisslife Information Systems Research Lab, Zuerich, Switzerland fkietz, volzg@swisslife.ch

More information

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases LIDER Survey Overview Participant profile (organisation type, industry sector) Relevant use-cases Discovering and extracting information Understanding opinion Content and data (Data Management) Monitoring

More information

Experiences with UIMA in NLP teaching and research. Manuela Kunze, Dietmar Rösner

Experiences with UIMA in NLP teaching and research. Manuela Kunze, Dietmar Rösner Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing Overview What is UIMA? First Experiments NLP Teaching

More information

Project Name. The Eclipse Integrated Computational Environment. Jay Jay Billings, ORNL Parent Project. None selected yet.

Project Name. The Eclipse Integrated Computational Environment. Jay Jay Billings, ORNL Parent Project. None selected yet. Project Name The Eclipse Integrated Computational Environment Jay Jay Billings, ORNL 20140219 Parent Project None selected yet. Background The science and engineering community relies heavily on modeling

More information

Best practices in the design, creation and dissemination of speech corpora at The Language Archive

Best practices in the design, creation and dissemination of speech corpora at The Language Archive LREC Workshop 18 2012-05-21 Istanbul Best practices in the design, creation and dissemination of speech corpora at The Language Archive Sebastian Drude, Daan Broeder, Peter Wittenburg, Han Sloetjes The

More information

A Linguistic Approach for Semantic Web Service Discovery

A Linguistic Approach for Semantic Web Service Discovery A Linguistic Approach for Semantic Web Service Discovery Jordy Sangers 307370js jordysangers@hotmail.com Bachelor Thesis Economics and Informatics Erasmus School of Economics Erasmus University Rotterdam

More information

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A Statistical parsing Fei Xia Feb 27, 2009 CSE 590A Statistical parsing History-based models (1995-2000) Recent development (2000-present): Supervised learning: reranking and label splitting Semi-supervised

More information

Annotation by category - ELAN and ISO DCR

Annotation by category - ELAN and ISO DCR Annotation by category - ELAN and ISO DCR Han Sloetjes, Peter Wittenburg Max Planck Institute for Psycholinguistics P.O. Box 310, 6500 AH Nijmegen, The Netherlands E-mail: Han.Sloetjes@mpi.nl, Peter.Wittenburg@mpi.nl

More information

Implementing a Variety of Linguistic Annotations

Implementing a Variety of Linguistic Annotations Implementing a Variety of Linguistic Annotations through a Common Web-Service Interface Adam Funk, Ian Roberts, Wim Peters University of Sheffield 18 May 2010 Adam Funk, Ian Roberts, Wim Peters Implementing

More information

Customisable Curation Workflows in Argo

Customisable Curation Workflows in Argo Customisable Curation Workflows in Argo Rafal Rak*, Riza Batista-Navarro, Andrew Rowley, Jacob Carter and Sophia Ananiadou National Centre for Text Mining, University of Manchester, UK *Corresponding author:

More information

Semantic Web and Natural Language Processing

Semantic Web and Natural Language Processing Semantic Web and Natural Language Processing Wiltrud Kessler Institut für Maschinelle Sprachverarbeitung Universität Stuttgart Semantic Web Winter 2014/2015 This work is licensed under a Creative Commons

More information

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network Roberto Navigli, Simone Paolo Ponzetto What is BabelNet a very large, wide-coverage multilingual

More information

Design and Realization of the EXCITEMENT Open Platform for Textual Entailment. Günter Neumann, DFKI Sebastian Pado, Universität Stuttgart

Design and Realization of the EXCITEMENT Open Platform for Textual Entailment. Günter Neumann, DFKI Sebastian Pado, Universität Stuttgart Design and Realization of the EXCITEMENT Open Platform for Textual Entailment Günter Neumann, DFKI Sebastian Pado, Universität Stuttgart Textual Entailment Textual Entailment (TE) A Text (T) entails a

More information

Meaning Banking and Beyond

Meaning Banking and Beyond Meaning Banking and Beyond Valerio Basile Wimmics, Inria November 18, 2015 Semantics is a well-kept secret in texts, accessible only to humans. Anonymous I BEG TO DIFFER Surface Meaning Step by step analysis

More information

Maca a configurable tool to integrate Polish morphological data. Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology

Maca a configurable tool to integrate Polish morphological data. Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology Maca a configurable tool to integrate Polish morphological data Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology Outline Morphological resources for Polish Tagset and segmentation differences

More information

L435/L555. Dept. of Linguistics, Indiana University Fall 2016

L435/L555. Dept. of Linguistics, Indiana University Fall 2016 for : for : L435/L555 Dept. of, Indiana University Fall 2016 1 / 12 What is? for : Decent definition from wikipedia: Computer programming... is a process that leads from an original formulation of a computing

More information

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI Department of Computer Science and Engineering IT6801 - SERVICE ORIENTED ARCHITECTURE Anna University 2 & 16 Mark Questions & Answers Year / Semester: IV /

More information

3 Publishing Technique

3 Publishing Technique Publishing Tool 32 3 Publishing Technique As discussed in Chapter 2, annotations can be extracted from audio, text, and visual features. The extraction of text features from the audio layer is the approach

More information

The Multilingual Language Library

The Multilingual Language Library The Multilingual Language Library @ LREC 2012 Let s build it together! Nicoletta Calzolari with Riccardo Del Gratta, Francesca Frontini, Francesco Rubino, Irene Russo Istituto di Linguistica Computazionale

More information

Recent Developments in the Czech National Corpus

Recent Developments in the Czech National Corpus Recent Developments in the Czech National Corpus Michal Křen Charles University in Prague 3 rd Workshop on the Challenges in the Management of Large Corpora Lancaster 20 July 2015 Introduction of the project

More information

EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates

EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates Alina Karakanta, Mihaela Vela, Elke Teich Department of Language Science and Technology, Saarland University Outline Introduction

More information

ANNIS3 Multiple Segmentation Corpora Guide

ANNIS3 Multiple Segmentation Corpora Guide ANNIS3 Multiple Segmentation Corpora Guide (For the latest documentation see also: http://korpling.github.io/annis) title: version: ANNIS3 Multiple Segmentation Corpora Guide 2013-6-15a author: Amir Zeldes

More information

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered.

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered. Content Enrichment An essential strategic capability for every publisher Enriched content. Delivered. An essential strategic capability for every publisher Overview Content is at the centre of everything

More information

Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population

Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population Heather Simpson 1, Stephanie Strassel 1, Robert Parker 1, Paul McNamee

More information

On a Java based implementation of ontology evolution processes based on Natural Language Processing

On a Java based implementation of ontology evolution processes based on Natural Language Processing ITALIAN NATIONAL RESEARCH COUNCIL NELLO CARRARA INSTITUTE FOR APPLIED PHYSICS CNR FLORENCE RESEARCH AREA Italy TECHNICAL, SCIENTIFIC AND RESEARCH REPORTS Vol. 2 - n. 65-8 (2010) Francesco Gabbanini On

More information

Unit 3 Corpus markup

Unit 3 Corpus markup Unit 3 Corpus markup 3.1 Introduction Data collected using a sampling frame as discussed in unit 2 forms a raw corpus. Yet such data typically needs to be processed before use. For example, spoken data

More information

Unsupervised Semantic Parsing

Unsupervised Semantic Parsing Unsupervised Semantic Parsing Hoifung Poon Dept. Computer Science & Eng. University of Washington (Joint work with Pedro Domingos) 1 Outline Motivation Unsupervised semantic parsing Learning and inference

More information

INTERNATIONAL STANDARD

INTERNATIONAL STANDARD INTERNATIONAL STANDARD IEC 61360-2 Edition 2.1 2004-02 Edition 2:2002 consolidated with amendment 1:2003 Standard data element types with associated classification scheme for electric components Part 2:

More information

Get the most value from your surveys with text analysis

Get the most value from your surveys with text analysis SPSS Text Analysis for Surveys 3.0 Specifications Get the most value from your surveys with text analysis The words people use to answer a question tell you a lot about what they think and feel. That s

More information

Ortolang Tools : MarsaTag

Ortolang Tools : MarsaTag Ortolang Tools : MarsaTag Stéphane Rauzy, Philippe Blache, Grégoire de Montcheuil SECOND VARIAMU WORKSHOP LPL, Aix-en-Provence August 20th & 21st, 2014 ORTOLANG received a State aid under the «Investissements

More information

RiMOM Results for OAEI 2009

RiMOM Results for OAEI 2009 RiMOM Results for OAEI 2009 Xiao Zhang, Qian Zhong, Feng Shi, Juanzi Li and Jie Tang Department of Computer Science and Technology, Tsinghua University, Beijing, China zhangxiao,zhongqian,shifeng,ljz,tangjie@keg.cs.tsinghua.edu.cn

More information

Apache UIMA and Mayo ctakes

Apache UIMA and Mayo ctakes Apache and Mayo and how it is used in the clinical domain March 16, 2012 Apache and Mayo Outline 1 Apache and Mayo Outline 1 2 Introducing Pipeline Modules Apache and Mayo What is? (You - eee - muh) Unstructured

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REVIEW PAPER ON IMPLEMENTATION OF DOCUMENT ANNOTATION USING CONTENT AND QUERYING

More information

LING203: Corpus. March 9, 2009

LING203: Corpus. March 9, 2009 LING203: Corpus March 9, 2009 Corpus A collection of machine readable texts SJSU LLD have many corpora http://linguistics.sjsu.edu/bin/view/public/chltcorpora Each corpus has a link to a description page

More information

Janne Bondi johannessen, Anders Nøklestad, Joel Priestley and Kristin Hagen. WP5: Glossa Integration

Janne Bondi johannessen, Anders Nøklestad, Joel Priestley and Kristin Hagen. WP5: Glossa Integration Janne Bondi johannessen, Anders Nøklestad, Joel Priestley and Kristin Hagen WP5: Glossa Integration WP5 Glossa integration The current Glossa corpus interface and analysis tool will be integrated in the

More information

NLP Final Project Fall 2015, Due Friday, December 18

NLP Final Project Fall 2015, Due Friday, December 18 NLP Final Project Fall 2015, Due Friday, December 18 For the final project, everyone is required to do some sentiment classification and then choose one of the other three types of projects: annotation,

More information

Army Research Laboratory

Army Research Laboratory Army Research Laboratory Arabic Natural Language Processing System Code Library by Stephen C. Tratz ARL-TN-0609 June 2014 Approved for public release; distribution is unlimited. NOTICES Disclaimers The

More information

clarin:el an infrastructure for documenting, sharing and processing language data

clarin:el an infrastructure for documenting, sharing and processing language data clarin:el an infrastructure for documenting, sharing and processing language data Stelios Piperidis, Penny Labropoulou, Maria Gavrilidou (Athena RC / ILSP) the problem 19/9/2015 ICGL12, FU-Berlin 2 use

More information

Annotation and Evaluation

Annotation and Evaluation Annotation and Evaluation Digging into Data: Jordan Boyd-Graber University of Maryland April 15, 2013 Digging into Data: Jordan Boyd-Graber (UMD) Annotation and Evaluation April 15, 2013 1 / 21 Exam Solutions

More information

CMDI and granularity

CMDI and granularity CMDI and granularity Identifier CLARIND-AP3-007 AP 3 Authors Dieter Van Uytvanck, Twan Goosen, Menzo Windhouwer Responsible Dieter Van Uytvanck Reference(s) Version Date Changes by State 1 2011-01-24 Dieter

More information

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017 Tokenization and Sentence Segmentation Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017 Outline 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation

More information

Towards open-domain QA. Question answering. TReC QA framework. TReC QA: evaluation

Towards open-domain QA. Question answering. TReC QA framework. TReC QA: evaluation Question ing Overview and task definition History Open-domain question ing Basic system architecture Watson s architecture Techniques Predictive indexing methods Pattern-matching methods Advanced techniques

More information

A Collaborative User-centered Approach to Fine-tune Geospatial

A Collaborative User-centered Approach to Fine-tune Geospatial A Collaborative User-centered Approach to Fine-tune Geospatial Database Design Grira Joel Bédard Yvan Sboui Tarek 16 octobre 2012 6th International Workshop on Semantic and Conceptual Issues in GIS - SeCoGIS

More information

The answer (circa 2001)

The answer (circa 2001) Question ing Question Answering Overview and task definition History Open-domain question ing Basic system architecture Predictive indexing methods Pattern-matching methods Advanced techniques? What was

More information

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed Let s get parsing! SpaCy default model includes tagger, parser and entity recognizer nlp = spacy.load('en ) tells spacy to use "en" with ["tagger", "parser", "ner"] Each component processes the Doc object,

More information

Plagiarism Detection Using FP-Growth Algorithm

Plagiarism Detection Using FP-Growth Algorithm Northeastern University NLP Project Report Plagiarism Detection Using FP-Growth Algorithm Varun Nandu (nandu.v@husky.neu.edu) Suraj Nair (nair.sur@husky.neu.edu) Supervised by Dr. Lu Wang December 10,

More information

Enhancing Automatic Wordnet Construction Using Word Embeddings

Enhancing Automatic Wordnet Construction Using Word Embeddings Enhancing Automatic Wordnet Construction Using Word Embeddings Feras Al Tarouti University of Colorado Colorado Springs 1420 Austin Bluffs Pkwy Colorado Springs, CO 80918, USA faltarou@uccs.edu Jugal Kalita

More information

OntoNotes: A Unified Relational Semantic Representation

OntoNotes: A Unified Relational Semantic Representation OntoNotes: A Unified Relational Semantic Representation Sameer S. Pradhan BBN Technologies Cambridge, MA 0138 Martha Palmer University of Colorado Boulder, CO 80309 Eduard Hovy ISI/USC Marina Del Rey,

More information

Cf. Gasch (2008), chapter 1.2 Generic and project-specific XML Schema design", p. 23 f. 5

Cf. Gasch (2008), chapter 1.2 Generic and project-specific XML Schema design, p. 23 f. 5 DGD 2.0: A Web-based Navigation Platform for the Visualization, Presentation and Retrieval of German Speech Corpora 1. Introduction 1.1 The Collection of German Speech Corpora at the IDS The "Institut

More information

Aid to spatial navigation within a UIMA annotation index

Aid to spatial navigation within a UIMA annotation index Aid to spatial navigation within a UIMA annotation index Nicolas Hernandez LINA CNRS UMR 6241 University de Nantes Darmstadt, 3rd UIMA@GSCL Workshop, September 23, 2013 N. Hernandez Spatial navigation

More information

EUDICO, Annotation and Exploitation of Multi Media Corpora over the Internet

EUDICO, Annotation and Exploitation of Multi Media Corpora over the Internet EUDICO, Annotation and Exploitation of Multi Media Corpora over the Internet Hennie Brugman, Albert Russel, Daan Broeder, Peter Wittenburg Max Planck Institute for Psycholinguistics P.O. Box 310, 6500

More information

Deliverable D1.4 Report Describing Integration Strategies and Experiments

Deliverable D1.4 Report Describing Integration Strategies and Experiments DEEPTHOUGHT Hybrid Deep and Shallow Methods for Knowledge-Intensive Information Extraction Deliverable D1.4 Report Describing Integration Strategies and Experiments The Consortium October 2004 Report Describing

More information