Deliverable D1.4 Report Describing Integration Strategies and Experiments

Size: px
Start display at page:

Download "Deliverable D1.4 Report Describing Integration Strategies and Experiments"

Transcription

1 DEEPTHOUGHT Hybrid Deep and Shallow Methods for Knowledge-Intensive Information Extraction Deliverable D1.4 Report Describing Integration Strategies and Experiments The Consortium October 2004

2 Report Describing Integration Strategies and Experiments D1.4 Project ref. no. Project acronym Project full title - Hybrid Deep and Shallow Methods for Knowledge-Intensive Information Extraction Security (distribution level) Public Contractual date of delivery Actual date of delivery Deliverable number D1.4 Deliverable name Type Status & version Number of pages 9 WP contributing to the deliverable WP / Task responsible Other contributors Author(s) EC Project Officer Keywords Abstract Report Describing Integration Strategies and Experiments Report Final WP1b WP1b John Carroll, Alex Fang, Melanie Siegel Evangelia Markidou Hybrid NLP, Named-Entity Recognition, architecture The implemented strategies for hybrid NLP are described and examples are given using screenshots. II

3 Content 1 Integration Strategies in the Heart of Gold 2 2 Mobile Phone Name Recognition for English Input and Output Specification Construction of an annotated sub-corpus The recognition program A quantitative evaluation of the mobile phone name recogniser Error Analysis References 11 1

4 1 Integration Strategies in the Heart of Gold Implemented strategies for hybrid NLP in the project include: The analysis results of NLP tools at lower processing levels can be used by components at higher levels. o For example, the deep linguistic analysis module PET uses default lexicon entries for Named-Entities that the Named-Entity Recognition Sprout delivers. o For example, the deep linguistic analysis module PET uses default lexicon entries for Part-of-Speech tags that the POS tagger TnT delivers. Deliver the deepest result found. If a module of the required depth cannot deliver a result, deliver the next deepest result. This is the approach that the autoresponse application mainly follows. 2

5 Deliver partial results, whenever a complete analysis is not available. Partial results are taken from the deepest module that delivers results. Combine modules and grammars for different languages. Each language has its own configuration of valid modules and grammars. 3

6 The different modules use a compatible output formalism, RMRS. o In case of shallower modules, this robust semantic structure allows for underspecification of, e.g., argument structure. Refine the data provided by shallower modules through deep parsing. This is a strategy the applications Business Intelligence and Autoresponse use. Chunk processing and named-entity recognition is used to find relevant information sources, while deep processing is then applied to the found information snippets, either to verify or to filter the extracted information. 4

7 2 Mobile Phone Name Recognition for English We describe below the construction and evaluation of a module for named entity recognition of mobile phone names. The module was integrated into the RASP English shallow analysis system, which in turn forms part of the Heart of Gold. On a manually annotated test set, the module achieved a recognition F-score of 81.5%. 2.1 Input and Output Specification The input to the mobile phone name recognition module is a sequence of sentences in English that have already been marked up in XML style for word boundaries, with part-of-speech tags automatically assigned by RASP (Briscoe and Carroll 2002). Since this happens before the morphological analyser in the RASP pipeline, the tokens have not been lemmatized. For example, given the sentence I am thinking of upgrading to the Sony Ericsson T68is from a Nokia 8260, the input to the module is: ^ ^ <w s='2' e='2'>i</w> PPIS1 <w s='4' e='5'>am</w> VBM <w s='7' e='14'>thinking</w> VVG <w s='16' e='17'>of</w> IO <w s='19' e='27'>upgrading</w> NN1 <w s='29' e='30'>to</w> II <w s='32' e='34'>the</w> AT <w s='36' e='39'>sony</w> NP1 <w s='41' e='48'>ericsson</w> NP1 <w s='50' e='54'>t68is</w> NN1 <w s='56' e='59'>from</w> II <w s='61' e='61'>a</w> AT1 <w s='63' e='67'>nokia</w> NP1 <w s='69' e='72'>8260</w> MC ^ ^ The task of the module is to mark up the mobile phone named entities in the input, namely Sony Ericsson T68is and Nokia 8260 in this example: ^ ^ 5

8 <w s='2' e='2'>i</w> PPIS1 <w s='4' e='5'>am</w> VBM <w s='7' e='14'>thinking</w> VVG <w s='16' e='17'>of</w> IO <w s='19' e='27'>upgrading</w> NN1 <w s='29' e='30'>to</w> II <w s='32' e='34'>the</w> AT <w netype='phone'> <w s='36' e='39'>sony</w> <w s='41' e='48'>ericsson</w> <w s='50' e='54'>t68is</w> </w> NP <w s='56' e='59'>from</w> II <w s='61' e='61'>a</w> AT1 <w netype='phone'> <w s='63' e='67'>nokia</w> <w s='69' e='72'>8260</w> </w> NP ^ ^ where Sony Ericsson T68is and Nokia 8260 are marked up as named entities of type mobile phone (i.e. netype='phone'). They are then treated as a single unit tagged as NP, namely, a proper name. The analysis based on this output from the module will be taken further down the RASP pipeline and yield the following RMRS representation: 6

9 2.2 Construction of an annotated sub-corpus For work described in Workpackage 2B, a 4,000,000-word corpus of Internet discussions on mobile phones was created for the domain-specific extraction of a verb subcategorisation lexicon (Carroll and Fang 2004). From this corpus, we randomly selected two sets of 200 texts each. Each text was then manually annotated such that each instance of a mobile phone name, a model number, or any combination of the two was marked up as an entity (<mobile> and </mobile>). Here is an example: I have for sale the following ORIGINAL <mobile> Nokia </mobile> accessories that will fit any of the <mobile> Nokia 6100 </mobile> / <mobile> 5100 </mobile> series phones, including but not limited to <mobile> 6160 </mobile>, <mobile> 6190 </mobile>, <mobile> 6188 </mobile>, <mobile> 6185 </mobile>, <mobile> 6162 </mobile>, <mobile> 6161 </mobile>, <mobile> 6185i </mobile>, <mobile> 5160 </mobile>, <mobile> 5190 </mobile>, etc. 7

10 The two sets are summarised in Table 1: Texts Sentences Words Entities Set Set Total Table 1: A summary of the annotated corpus 2.3 The recognition program The automatic recogniser was implemented in C. The algorithm was designed based on the observation that the distribution of mobile phone names in our corpus is relatively sparse. There is insufficient data to train a purely statistical recogniser (e.g. a Maximum Entropy Model); it may however be possible to train a combined symbolic/statistical model (incorporating information for example on manufacturer names). A set of mobile phone manufacturer names, such as Nokia and Ericsson, was manually drawn up. The remainder of the mobile phone corpus that had not been annotated (ca 2,800,000 tokens) was then used to construct a list of all the alphanumeric strings that contain at least 1 digit and that immediately follow one of these names. This process resulted in two entity sets: a list of mobile phone names a list of model numbers with their associated mobile phone names The automatic recogniser marks the following as an entity: every occurrence of the mobile phone names every occurrence of the model numbers, given the following conditions they are longer than 3 characters in length they occurred more than once in the training corpus they occurred less than times in the training corpus 1 Numbers occurring more than 2000 times are interpreted as genuine "free" cardinals that are unlikely to be used in reference to a mobile phone. 8

11 2.4 A quantitative evaluation of the mobile phone name recogniser For the quantitative evaluation of the mobile phone name recogniser's performance, the first annotated set was used for development and the second set was kept for testing. Both sets were sub-divided into 4 sets containing the same number of word tokens with a view to indicate any possible variation in terms of performance. The initial run of the recogniser on the development set produced the following results: Total Precision Recall F-Score Table 2: Performance before tuning on the development set The F-Score for the development set was just under 80%. Variations across the four sub-sets can be observed, with Set 3 showing the best F-Score of 83.6%. The output was manually inspected and changes made to the list of mobile phone names and model numbers. Subsequent performance on the development set shows an F-Score of 82.1%, an increase of nearly 3% from the previous 79.4%: Total Precision Recall F-Score Table 3: Performance after tuning on the development set When tested on the test set, the recogniser achieved an overall performance of 81.5%, with a precision score of 81% and a recall rate of 81.9%: Total 9

12 Precision Recall F-Score Table 4: Performance on the test set As can be observed from the table above, the best performance was 91.1% F-Score and the worst performance was 73.2%, showing considerable variation in this set and therefore suggesting that the performance of the system varies with different types of input. 10

13 2.5 Error Analysis There are two major sources of errors. First of all, there is frequent ambiguity between phone names and company names, as in the following example: I was wondering if anyone has any information on how the Ericsson Bluetooth kits calculates the BER packets when the BER test is run. where Ericsson can be analysed as referring to the company instead of the phone. Arguably, this is a genuinely ambiguous case. The second major ambiguity is between numbers and model numbers: There are 2 connectors on the cable, 1 RS 232 and 1 cigarette lighter. Since 232 has been observed before as co-occurring with mobile phone names, the module believes that in the current context it refers to a mobile phone product and therefore erroneously marks it as a phone name. 3 References Briscoe, E. and J. Carroll Robust accurate statistical annotation of general text. In Proceedings of the 3 rd International Conference on Language Resources and Evaluation, Las Palmas, Gran Canaria Carroll, J. and A.C. Fang The Automatic Acquisition of Verb Subcategorisations and their Impact on an HPSG Parser. In Proceedings of the 1 st International Joint Conference on Natural Language Processing, March 2004, Hainan, China. Uszkoreit, Hans, Ulrich Callmeier, Andreas Eisele, Ulrich Schäfer, Melanie Siegel, Jakob Uszkoreit (2004): Hybrid Robust Deep and Shallow Semantic Processing for Creativity Support in Document Production. In Proceedings of KONVENS 2004, Vienna, Austria. Callmeier, Ulrich, Eisele, Andreas, Schäfer, Ulrich and Melanie Siegel (2004): The Core Architecture Framework. In Proceedings of LREC 04, Lisbon, Portugal. 11

Deliverable 4.6 Architecture Specification and Mock-up System

Deliverable 4.6 Architecture Specification and Mock-up System DEEPTHOUGHT Hybrid Deep and Shallow Methods for Knowledge-Intensive Information Extraction Deliverable 4.6 Architecture Specification and Mock-up System The Consortium October 2003 I II PROJECT REF. NO.

More information

HyLaP-AM Semantic Search in Scientific Documents

HyLaP-AM Semantic Search in Scientific Documents HyLaP-AM Semantic Search in Scientific Documents Ulrich Schäfer, Hans Uszkoreit, Christian Federmann, Yajing Zhang, Torsten Marek DFKI Language Technology Lab Talk Outline Extracting facts form scientific

More information

SEMINAR: RECENT ADVANCES IN PARSING TECHNOLOGY. Parser Evaluation Approaches

SEMINAR: RECENT ADVANCES IN PARSING TECHNOLOGY. Parser Evaluation Approaches SEMINAR: RECENT ADVANCES IN PARSING TECHNOLOGY Parser Evaluation Approaches NATURE OF PARSER EVALUATION Return accurate syntactic structure of sentence. Which representation? Robustness of parsing. Quick

More information

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017 Tokenization and Sentence Segmentation Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017 Outline 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation

More information

NLP in practice, an example: Semantic Role Labeling

NLP in practice, an example: Semantic Role Labeling NLP in practice, an example: Semantic Role Labeling Anders Björkelund Lund University, Dept. of Computer Science anders.bjorkelund@cs.lth.se October 15, 2010 Anders Björkelund NLP in practice, an example:

More information

October 19, 2004 Chapter Parsing

October 19, 2004 Chapter Parsing October 19, 2004 Chapter 10.3 10.6 Parsing 1 Overview Review: CFGs, basic top-down parser Dynamic programming Earley algorithm (how it works, how it solves the problems) Finite-state parsing 2 Last time

More information

First Version of Grammar Matrix

First Version of Grammar Matrix DEEPTHOUGHT Hybrid Deep and Shallow Methods for Knowldege-Intensive Information Extraction Deliverable 3.1 First Version of Grammar Matrix The DeepThought Consortium March 2003 DeepThought IST-2000-30161

More information

Ortolang Tools : MarsaTag

Ortolang Tools : MarsaTag Ortolang Tools : MarsaTag Stéphane Rauzy, Philippe Blache, Grégoire de Montcheuil SECOND VARIAMU WORKSHOP LPL, Aix-en-Provence August 20th & 21st, 2014 ORTOLANG received a State aid under the «Investissements

More information

Large-Scale Syntactic Processing: Parsing the Web. JHU 2009 Summer Research Workshop

Large-Scale Syntactic Processing: Parsing the Web. JHU 2009 Summer Research Workshop Large-Scale Syntactic Processing: JHU 2009 Summer Research Workshop Intro CCG parser Tasks 2 The Team Stephen Clark (Cambridge, UK) Ann Copestake (Cambridge, UK) James Curran (Sydney, Australia) Byung-Gyu

More information

Apache UIMA and Mayo ctakes

Apache UIMA and Mayo ctakes Apache and Mayo and how it is used in the clinical domain March 16, 2012 Apache and Mayo Outline 1 Apache and Mayo Outline 1 2 Introducing Pipeline Modules Apache and Mayo What is? (You - eee - muh) Unstructured

More information

Implementing a Variety of Linguistic Annotations

Implementing a Variety of Linguistic Annotations Implementing a Variety of Linguistic Annotations through a Common Web-Service Interface Adam Funk, Ian Roberts, Wim Peters University of Sheffield 18 May 2010 Adam Funk, Ian Roberts, Wim Peters Implementing

More information

Deliverable 6.1 Results of a Workshop on Roadmap Activities

Deliverable 6.1 Results of a Workshop on Roadmap Activities DEEPTHOUGHT Hybrid Deep and Shallow Methods for Knowledge-Intensive Information Extraction Deliverable 6.1 Results of a Workshop on Roadmap Activities The Consortium April 2003 1 PROJECT REF. NO. Project

More information

UIMA-based Annotation Type System for a Text Mining Architecture

UIMA-based Annotation Type System for a Text Mining Architecture UIMA-based Annotation Type System for a Text Mining Architecture Udo Hahn, Ekaterina Buyko, Katrin Tomanek, Scott Piao, Yoshimasa Tsuruoka, John McNaught, Sophia Ananiadou Jena University Language and

More information

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Donald C. Comeau *, Haibin Liu, Rezarta Islamaj Doğan and W. John Wilbur National Center

More information

Final Project Discussion. Adam Meyers Montclair State University

Final Project Discussion. Adam Meyers Montclair State University Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...

More information

Grammar Knowledge Transfer for Building RMRSs over Dependency Parses in Bulgarian

Grammar Knowledge Transfer for Building RMRSs over Dependency Parses in Bulgarian Grammar Knowledge Transfer for Building RMRSs over Dependency Parses in Bulgarian Kiril Simov and Petya Osenova Linguistic Modelling Department, IICT, Bulgarian Academy of Sciences DELPH-IN, Sofia, 2012

More information

Deliverable D Adapted tools for the QTLaunchPad infrastructure

Deliverable D Adapted tools for the QTLaunchPad infrastructure This document is part of the Coordination and Support Action Preparation and Launch of a Large-scale Action for Quality Translation Technology (QTLaunchPad). This project has received funding from the

More information

A tool for Cross-Language Pair Annotations: CLPA

A tool for Cross-Language Pair Annotations: CLPA A tool for Cross-Language Pair Annotations: CLPA August 28, 2006 This document describes our tool called Cross-Language Pair Annotator (CLPA) that is capable to automatically annotate cognates and false

More information

An UIMA based Tool Suite for Semantic Text Processing

An UIMA based Tool Suite for Semantic Text Processing An UIMA based Tool Suite for Semantic Text Processing Katrin Tomanek, Ekaterina Buyko, Udo Hahn Jena University Language & Information Engineering Lab StemNet Knowledge Management for Immunology in life

More information

Making Sense Out of the Web

Making Sense Out of the Web Making Sense Out of the Web Rada Mihalcea University of North Texas Department of Computer Science rada@cs.unt.edu Abstract. In the past few years, we have witnessed a tremendous growth of the World Wide

More information

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 1, January- June (2012), TECHNOLOGY (IJCET) IAEME ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume

More information

Text mining tools for semantically enriching the scientific literature

Text mining tools for semantically enriching the scientific literature Text mining tools for semantically enriching the scientific literature Sophia Ananiadou Director National Centre for Text Mining School of Computer Science University of Manchester Need for enriching the

More information

Machine Learning in GATE

Machine Learning in GATE Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell Recap Previous two days looked at knowledge engineered IE This session looks at machine learned IE Supervised learning Effort

More information

Annotating Spatio-Temporal Information in Documents

Annotating Spatio-Temporal Information in Documents Annotating Spatio-Temporal Information in Documents Jannik Strötgen University of Heidelberg Institute of Computer Science Database Systems Research Group http://dbs.ifi.uni-heidelberg.de stroetgen@uni-hd.de

More information

Flexible Interfaces in the Application of Language Technology to an escience Corpus

Flexible Interfaces in the Application of Language Technology to an escience Corpus Flexible Interfaces in the Application of Language Technology to an escience Corpus C.J. Rupp, Ann Copestake, Simone Teufel, Benjamin Waldron Computer Laboratory, University of Cambridge Abstract We describe

More information

NLP Chain. Giuseppe Castellucci Web Mining & Retrieval a.a. 2013/2014

NLP Chain. Giuseppe Castellucci Web Mining & Retrieval a.a. 2013/2014 NLP Chain Giuseppe Castellucci castellucci@ing.uniroma2.it Web Mining & Retrieval a.a. 2013/2014 Outline NLP chains RevNLT Exercise NLP chain Automatic analysis of texts At different levels Token Morphological

More information

JU_CSE_TE: System Description 2010 ResPubliQA

JU_CSE_TE: System Description 2010 ResPubliQA JU_CSE_TE: System Description QA@CLEF 2010 ResPubliQA Partha Pakray 1, Pinaki Bhaskar 1, Santanu Pal 1, Dipankar Das 1, Sivaji Bandyopadhyay 1, Alexander Gelbukh 2 Department of Computer Science & Engineering

More information

Module 3: GATE and Social Media. Part 4. Named entities

Module 3: GATE and Social Media. Part 4. Named entities Module 3: GATE and Social Media Part 4. Named entities The 1995-2018 This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs Licence Named Entity Recognition Texts frequently

More information

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed Let s get parsing! SpaCy default model includes tagger, parser and entity recognizer nlp = spacy.load('en ) tells spacy to use "en" with ["tagger", "parser", "ner"] Each component processes the Doc object,

More information

Statistical Parsing for Text Mining from Scientific Articles

Statistical Parsing for Text Mining from Scientific Articles Statistical Parsing for Text Mining from Scientific Articles Ted Briscoe Computer Laboratory University of Cambridge November 30, 2004 Contents 1 Text Mining 2 Statistical Parsing 3 The RASP System 4 The

More information

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012 A CAS STUDY: Structure learning for Part-of-Speech Tagging Danilo Croce WM 2011/2012 27 gennaio 2012 TASK definition One of the tasks of VALITA 2009 VALITA is an initiative devoted to the evaluation of

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Meaning Banking and Beyond

Meaning Banking and Beyond Meaning Banking and Beyond Valerio Basile Wimmics, Inria November 18, 2015 Semantics is a well-kept secret in texts, accessible only to humans. Anonymous I BEG TO DIFFER Surface Meaning Step by step analysis

More information

Natural Language Processing. SoSe Question Answering

Natural Language Processing. SoSe Question Answering Natural Language Processing SoSe 2017 Question Answering Dr. Mariana Neves July 5th, 2017 Motivation Find small segments of text which answer users questions (http://start.csail.mit.edu/) 2 3 Motivation

More information

Fast and Effective System for Name Entity Recognition on Big Data

Fast and Effective System for Name Entity Recognition on Big Data International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-3, Issue-2 E-ISSN: 2347-2693 Fast and Effective System for Name Entity Recognition on Big Data Jigyasa Nigam

More information

CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools

CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools Wahed Hemati, Alexander Mehler, and Tolga Uslu Text Technology Lab, Goethe Universitt

More information

Automatic Metadata Extraction for Archival Description and Access

Automatic Metadata Extraction for Archival Description and Access Automatic Metadata Extraction for Archival Description and Access WILLIAM UNDERWOOD Georgia Tech Research Institute Abstract: The objective of the research reported is this paper is to develop techniques

More information

Text Mining for Software Engineering

Text Mining for Software Engineering Text Mining for Software Engineering Faculty of Informatics Institute for Program Structures and Data Organization (IPD) Universität Karlsruhe (TH), Germany Department of Computer Science and Software

More information

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet Joerg-Uwe Kietz, Alexander Maedche, Raphael Volz Swisslife Information Systems Research Lab, Zuerich, Switzerland fkietz, volzg@swisslife.ch

More information

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points? Ranked Retrieval One option is to average the precision scores at discrete Precision 100% 0% More junk 100% Everything points on the ROC curve But which points? Recall We want to evaluate the system, not

More information

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A Statistical parsing Fei Xia Feb 27, 2009 CSE 590A Statistical parsing History-based models (1995-2000) Recent development (2000-present): Supervised learning: reranking and label splitting Semi-supervised

More information

Homework 2: Parsing and Machine Learning

Homework 2: Parsing and Machine Learning Homework 2: Parsing and Machine Learning COMS W4705_001: Natural Language Processing Prof. Kathleen McKeown, Fall 2017 Due: Saturday, October 14th, 2017, 2:00 PM This assignment will consist of tasks in

More information

Towards an Integrated Architecture for Composite Language Services and Multiple Linguistic Processing Components

Towards an Integrated Architecture for Composite Language Services and Multiple Linguistic Processing Components Towards an Integrated Architecture for Composite Language s and Multiple Linguistic Processing Components Arif Bramantoro 1, Ulrich Schäfer 2, Toru Ishida 1 1 Department of Social Informatics, Kyoto University,

More information

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands Svetlana Stoyanchev, Hyuckchul Jung, John Chen, Srinivas Bangalore AT&T Labs Research 1 AT&T Way Bedminster NJ 07921 {sveta,hjung,jchen,srini}@research.att.com

More information

Generating FrameNets of various granularities: The FrameNet Transformer

Generating FrameNets of various granularities: The FrameNet Transformer Generating FrameNets of various granularities: The FrameNet Transformer Josef Ruppenhofer, Jonas Sunde, & Manfred Pinkal Saarland University LREC, May 2010 Ruppenhofer, Sunde, Pinkal (Saarland U.) Generating

More information

Customisable Curation Workflows in Argo

Customisable Curation Workflows in Argo Customisable Curation Workflows in Argo Rafal Rak*, Riza Batista-Navarro, Andrew Rowley, Jacob Carter and Sophia Ananiadou National Centre for Text Mining, University of Manchester, UK *Corresponding author:

More information

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS 82 CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS In recent years, everybody is in thirst of getting information from the internet. Search engines are used to fulfill the need of them. Even though the

More information

Enabling Semantic Search in Large Open Source Communities

Enabling Semantic Search in Large Open Source Communities Enabling Semantic Search in Large Open Source Communities Gregor Leban, Lorand Dali, Inna Novalija Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana {gregor.leban, lorand.dali, inna.koval}@ijs.si

More information

Unstructured Information Management Architecture (UIMA) Graham Wilcock University of Helsinki

Unstructured Information Management Architecture (UIMA) Graham Wilcock University of Helsinki Unstructured Information Management Architecture (UIMA) Graham Wilcock University of Helsinki Overview What is UIMA? A framework for NLP tasks and tools Part-of-Speech Tagging Full Parsing Shallow Parsing

More information

Mention Detection: Heuristics for the OntoNotes annotations

Mention Detection: Heuristics for the OntoNotes annotations Mention Detection: Heuristics for the OntoNotes annotations Jonathan K. Kummerfeld, Mohit Bansal, David Burkett and Dan Klein Computer Science Division University of California at Berkeley {jkk,mbansal,dburkett,klein}@cs.berkeley.edu

More information

Introduction to IE and ANNIE

Introduction to IE and ANNIE Introduction to IE and ANNIE The University of Sheffield, 1995-2013 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence. About this tutorial This tutorial comprises

More information

A BNC-like corpus of American English

A BNC-like corpus of American English The American National Corpus Everything You Always Wanted To Know... And Weren t Afraid To Ask Nancy Ide Department of Computer Science Vassar College What is the? A BNC-like corpus of American English

More information

Morpho-syntactic Analysis with the Stanford CoreNLP

Morpho-syntactic Analysis with the Stanford CoreNLP Morpho-syntactic Analysis with the Stanford CoreNLP Danilo Croce croce@info.uniroma2.it WmIR 2015/2016 Objectives of this tutorial Use of a Natural Language Toolkit CoreNLP toolkit Morpho-syntactic analysis

More information

Voting between Multiple Data Representations for Text Chunking

Voting between Multiple Data Representations for Text Chunking Voting between Multiple Data Representations for Text Chunking Hong Shen and Anoop Sarkar School of Computing Science Simon Fraser University Burnaby, BC V5A 1S6, Canada {hshen,anoop}@cs.sfu.ca Abstract.

More information

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES Saturday 10 th December 2016 09:30 to 11:30 INSTRUCTIONS

More information

D4.6 Data Value Chain Database v2

D4.6 Data Value Chain Database v2 D4.6 Data Value Chain Database v2 Coordinator: Fabrizio Orlandi (Fraunhofer) With contributions from: Isaiah Mulang Onando (Fraunhofer), Luis-Daniel Ibáñez (SOTON) Reviewer: Ryan Goodman (ODI) Deliverable

More information

The Multilingual Language Library

The Multilingual Language Library The Multilingual Language Library @ LREC 2012 Let s build it together! Nicoletta Calzolari with Riccardo Del Gratta, Francesca Frontini, Francesco Rubino, Irene Russo Istituto di Linguistica Computazionale

More information

BD003: Introduction to NLP Part 2 Information Extraction

BD003: Introduction to NLP Part 2 Information Extraction BD003: Introduction to NLP Part 2 Information Extraction The University of Sheffield, 1995-2017 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence. Contents This

More information

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Martin Rajman, Pierre Andrews, María del Mar Pérez Almenta, and Florian Seydoux Artificial Intelligence

More information

Tagging and parsing German using Spejd

Tagging and parsing German using Spejd Tagging and parsing German using Spejd Andreas Völlger Reykjavik University Reykjavik, Iceland andreasv10@ru.is Abstract Spejd is a newer tool for morphosyntactic disambiguation and shallow parsing. Contrary

More information

Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population

Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population Heather Simpson 1, Stephanie Strassel 1, Robert Parker 1, Paul McNamee

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK NG, Jun Ping National University of Singapore ngjp@nus.edu.sg 30 November 2009 The latest version of QANUS and this documentation can always be downloaded from

More information

Automated Extraction of Event Details from Text Snippets

Automated Extraction of Event Details from Text Snippets Automated Extraction of Event Details from Text Snippets Kavi Goel, Pei-Chin Wang December 16, 2005 1 Introduction We receive emails about events all the time. A message will typically include the title

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Transforming Requirements into MDA from User Stories to CIM

Transforming Requirements into MDA from User Stories to CIM , pp.15-22 http://dx.doi.org/10.14257/ijseia.2017.11.8.03 Transing Requirements into MDA from User Stories to CIM Meryem Elallaoui 1, Khalid Nafil 2 and Raja Touahni 1 1 Faculty of Sciences, Ibn Tofail

More information

Assignment #1: Named Entity Recognition

Assignment #1: Named Entity Recognition Assignment #1: Named Entity Recognition Dr. Zornitsa Kozareva USC Information Sciences Institute Spring 2013 Task Description: You will be given three data sets total. First you will receive the train

More information

System Combination Using Joint, Binarised Feature Vectors

System Combination Using Joint, Binarised Feature Vectors System Combination Using Joint, Binarised Feature Vectors Christian F EDERMAN N 1 (1) DFKI GmbH, Language Technology Lab, Stuhlsatzenhausweg 3, D-6613 Saarbrücken, GERMANY cfedermann@dfki.de Abstract We

More information

Exam Marco Kuhlmann. This exam consists of three parts:

Exam Marco Kuhlmann. This exam consists of three parts: TDDE09, 729A27 Natural Language Processing (2017) Exam 2017-03-13 Marco Kuhlmann This exam consists of three parts: 1. Part A consists of 5 items, each worth 3 points. These items test your understanding

More information

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Transition-Based Dependency Parsing with Stack Long Short-Term Memory Transition-Based Dependency Parsing with Stack Long Short-Term Memory Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, Noah A. Smith Association for Computational Linguistics (ACL), 2015 Presented

More information

Lecture 14: Annotation

Lecture 14: Annotation Lecture 14: Annotation Nathan Schneider (with material from Henry Thompson, Alex Lascarides) ENLP 23 October 2016 1/14 Annotation Why gold 6= perfect Quality Control 2/14 Factors in Annotation Suppose

More information

Mining Aspects in Requirements

Mining Aspects in Requirements Mining Aspects in Requirements Américo Sampaio, Neil Loughran, Awais Rashid and Paul Rayson Computing Department, Lancaster University, Lancaster, UK {a.sampaio, loughran, marash, paul}@comp.lancs.ac.uk

More information

Maximum Entropy based Natural Language Interface for Relational Database

Maximum Entropy based Natural Language Interface for Relational Database International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 7, Number 1 (2014), pp. 69-77 International Research Publication House http://www.irphouse.com Maximum Entropy based

More information

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP IJIRST International Journal for Innovative Research in Science & Technology Volume 2 Issue 02 July 2015 ISSN (online): 2349-6010 A Hybrid Unsupervised Web Data Extraction using Trinity and NLP Anju R

More information

LAB 3: Text processing + Apache OpenNLP

LAB 3: Text processing + Apache OpenNLP LAB 3: Text processing + Apache OpenNLP 1. Motivation: The text that was derived (e.g., crawling + using Apache Tika) must be processed before being used in an information retrieval system. Text processing

More information

* Overview. Ontology-Guided Information Extraction from Pathology Reports The SWPatho Project David Schlangen Universität Potsdam

* Overview. Ontology-Guided Information Extraction from Pathology Reports The SWPatho Project David Schlangen Universität Potsdam Overview Background of project The task The system Digression: gently machine aided ontology construction Evaluation Future Work -Guided Information Extraction from Pathology Reports The SWPatho Project

More information

CACAO PROJECT AT THE 2009 TASK

CACAO PROJECT AT THE 2009 TASK CACAO PROJECT AT THE TEL@CLEF 2009 TASK Alessio Bosca, Luca Dini Celi s.r.l. - 10131 Torino - C. Moncalieri, 21 alessio.bosca, dini@celi.it Abstract This paper presents the participation of the CACAO prototype

More information

NLP - Based Expert System for Database Design and Development

NLP - Based Expert System for Database Design and Development NLP - Based Expert System for Database Design and Development U. Leelarathna 1, G. Ranasinghe 1, N. Wimalasena 1, D. Weerasinghe 1, A. Karunananda 2 Faculty of Information Technology, University of Moratuwa,

More information

Precise Medication Extraction using Agile Text Mining

Precise Medication Extraction using Agile Text Mining Precise Medication Extraction using Agile Text Mining Chaitanya Shivade *, James Cormack, David Milward * The Ohio State University, Columbus, Ohio, USA Linguamatics Ltd, Cambridge, UK shivade@cse.ohio-state.edu,

More information

Learning Latent Linguistic Structure to Optimize End Tasks. David A. Smith with Jason Naradowsky and Xiaoye Tiger Wu

Learning Latent Linguistic Structure to Optimize End Tasks. David A. Smith with Jason Naradowsky and Xiaoye Tiger Wu Learning Latent Linguistic Structure to Optimize End Tasks David A. Smith with Jason Naradowsky and Xiaoye Tiger Wu 12 October 2012 Learning Latent Linguistic Structure to Optimize End Tasks David A. Smith

More information

I Know Your Name: Named Entity Recognition and Structural Parsing

I Know Your Name: Named Entity Recognition and Structural Parsing I Know Your Name: Named Entity Recognition and Structural Parsing David Philipson and Nikil Viswanathan {pdavid2, nikil}@stanford.edu CS224N Fall 2011 Introduction In this project, we explore a Maximum

More information

Evaluation of Named Entity Recognition in Dutch online criminal complaints

Evaluation of Named Entity Recognition in Dutch online criminal complaints Evaluation of Named Entity Recognition in Dutch online criminal complaints Marijn Schraagen Floris Bex Matthieu Brinkhuis Utrecht University June 12, 2017 Internet fraud Online trade is widespread Transactions

More information

TIPSTER Text Phase II Architecture Requirements

TIPSTER Text Phase II Architecture Requirements 1.0 INTRODUCTION TIPSTER Text Phase II Architecture Requirements 1.1 Requirements Traceability Version 2.0p 3 June 1996 Architecture Commitee tipster @ tipster.org The requirements herein are derived from

More information

A Multilingual Social Media Linguistic Corpus

A Multilingual Social Media Linguistic Corpus A Multilingual Social Media Linguistic Corpus Luis Rei 1,2 Dunja Mladenić 1,2 Simon Krek 1 1 Artificial Intelligence Laboratory Jožef Stefan Institute 2 Jožef Stefan International Postgraduate School 4th

More information

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Discriminative Training with Perceptron Algorithm for POS Tagging Task Discriminative Training with Perceptron Algorithm for POS Tagging Task Mahsa Yarmohammadi Center for Spoken Language Understanding Oregon Health & Science University Portland, Oregon yarmoham@ohsu.edu

More information

WebAnno: a flexible, web-based annotation tool for CLARIN

WebAnno: a flexible, web-based annotation tool for CLARIN WebAnno: a flexible, web-based annotation tool for CLARIN Richard Eckart de Castilho, Chris Biemann, Iryna Gurevych, Seid Muhie Yimam #WebAnno This work is licensed under a Attribution-NonCommercial-ShareAlike

More information

structure of the presentation Frame Semantics knowledge-representation in larger-scale structures the concept of frame

structure of the presentation Frame Semantics knowledge-representation in larger-scale structures the concept of frame structure of the presentation Frame Semantics semantic characterisation of situations or states of affairs 1. introduction (partially taken from a presentation of Markus Egg): i. what is a frame supposed

More information

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island Taming Text How to Find, Organize, and Manipulate It GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS 11 MANNING Shelter Island contents foreword xiii preface xiv acknowledgments xvii about this book

More information

Anette Frank, Markus Becker, Berthold Crysmann, Bernd Kiefer and Ulrich Schäfer

Anette Frank, Markus Becker, Berthold Crysmann, Bernd Kiefer and Ulrich Schäfer Integrated Shallow and Deep Parsing: TopP meets HPSG Anette Frank, Markus Becker, Berthold Crysmann, Bernd Kiefer and Ulrich Schäfer DFKI GmbH School of Informatics 66123 Saarbrücken, Germany University

More information

Automatic Evaluation of Parser Robustness: Eliminating Manual Labor and Annotated Resources

Automatic Evaluation of Parser Robustness: Eliminating Manual Labor and Annotated Resources Automatic Evaluation of Parser Robustness: Eliminating Manual Labor and Annotated Resources Johnny BIGERT KTH Nada SE-10044 Stockholm johnny@nada.kth.se Jonas SJÖBERGH KTH Nada SE-10044 Stockholm jsh@nada.kth.se

More information

ANC2Go: A Web Application for Customized Corpus Creation

ANC2Go: A Web Application for Customized Corpus Creation ANC2Go: A Web Application for Customized Corpus Creation Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science, Vassar College Poughkeepsie, New York 12604 USA {ide, suderman, brsimms}@cs.vassar.edu

More information

A Linguistic Approach for Semantic Web Service Discovery

A Linguistic Approach for Semantic Web Service Discovery A Linguistic Approach for Semantic Web Service Discovery Jordy Sangers 307370js jordysangers@hotmail.com Bachelor Thesis Economics and Informatics Erasmus School of Economics Erasmus University Rotterdam

More information

Prediction-Based NLP System by Boyer-Moore Algorithm for Requirements Elicitation

Prediction-Based NLP System by Boyer-Moore Algorithm for Requirements Elicitation Prediction-Based NLP System by Boyer-Moore Algorithm for Requirements Elicitation Dr A.Sumithra 1, K.Poongothai 2, Dr S.Gavaskar 3 1 Associate Professor, Dept of Computer Science & Engineering, VSB College

More information

Lab II - Product Specification Outline. CS 411W Lab II. Prototype Product Specification For CLASH. Professor Janet Brunelle Professor Hill Price

Lab II - Product Specification Outline. CS 411W Lab II. Prototype Product Specification For CLASH. Professor Janet Brunelle Professor Hill Price Lab II - Product Specification Outline CS 411W Lab II Prototype Product Specification For CLASH Professor Janet Brunelle Professor Hill Price Prepared by: Artem Fisan Date: 04/20/2015 Table of Contents

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Corpus methods for sociolinguistics. Emily M. Bender NWAV 31 - October 10, 2002

Corpus methods for sociolinguistics. Emily M. Bender NWAV 31 - October 10, 2002 Corpus methods for sociolinguistics Emily M. Bender bender@csli.stanford.edu NWAV 31 - October 10, 2002 Overview Introduction Corpora of interest Software for accessing and analyzing corpora (demo) Basic

More information

A platform for collaborative semantic annotation

A platform for collaborative semantic annotation A platform for collaborative semantic annotation Valerio Basile and Johan Bos and Kilian Evang and Noortje Venhuizen {v.basile,johan.bos,k.evang,n.j.venhuizen}@rug.nl Center for Language and Cognition

More information

A Textual Entailment System using Web based Machine Translation System

A Textual Entailment System using Web based Machine Translation System A Textual Entailment System using Web based Machine Translation System Partha Pakray 1, Snehasis Neogi 1, Sivaji Bandyopadhyay 1, Alexander Gelbukh 2 1 Computer Science and Engineering Department, Jadavpur

More information

Question Answering Approach Using a WordNet-based Answer Type Taxonomy

Question Answering Approach Using a WordNet-based Answer Type Taxonomy Question Answering Approach Using a WordNet-based Answer Type Taxonomy Seung-Hoon Na, In-Su Kang, Sang-Yool Lee, Jong-Hyeok Lee Department of Computer Science and Engineering, Electrical and Computer Engineering

More information

English Understanding: From Annotations to AMRs

English Understanding: From Annotations to AMRs English Understanding: From Annotations to AMRs Nathan Schneider August 28, 2012 :: ISI NLP Group :: Summer Internship Project Presentation 1 Current state of the art: syntax-based MT Hierarchical/syntactic

More information