Deliverable D1.4 Report Describing Integration Strategies and Experiments

Similar documents
Deliverable 4.6 Architecture Specification and Mock-up System

HyLaP-AM Semantic Search in Scientific Documents

SEMINAR: RECENT ADVANCES IN PARSING TECHNOLOGY. Parser Evaluation Approaches

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

NLP in practice, an example: Semantic Role Labeling

October 19, 2004 Chapter Parsing

First Version of Grammar Matrix

Ortolang Tools : MarsaTag

Large-Scale Syntactic Processing: Parsing the Web. JHU 2009 Summer Research Workshop

Apache UIMA and Mayo ctakes

Implementing a Variety of Linguistic Annotations

Deliverable 6.1 Results of a Workshop on Roadmap Activities

UIMA-based Annotation Type System for a Text Mining Architecture

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Final Project Discussion. Adam Meyers Montclair State University

Grammar Knowledge Transfer for Building RMRSs over Dependency Parses in Bulgarian

Deliverable D Adapted tools for the QTLaunchPad infrastructure

A tool for Cross-Language Pair Annotations: CLPA

An UIMA based Tool Suite for Semantic Text Processing

Making Sense Out of the Web

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

Text mining tools for semantically enriching the scientific literature

Machine Learning in GATE

Annotating Spatio-Temporal Information in Documents

Flexible Interfaces in the Application of Language Technology to an escience Corpus

NLP Chain. Giuseppe Castellucci Web Mining & Retrieval a.a. 2013/2014

JU_CSE_TE: System Description 2010 ResPubliQA

Module 3: GATE and Social Media. Part 4. Named entities

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

Statistical Parsing for Text Mining from Scientific Articles

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Meaning Banking and Beyond

Natural Language Processing. SoSe Question Answering

Fast and Effective System for Name Entity Recognition on Big Data

CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools

Automatic Metadata Extraction for Archival Description and Access

Text Mining for Software Engineering

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Homework 2: Parsing and Machine Learning

Towards an Integrated Architecture for Composite Language Services and Multiple Linguistic Processing Components

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

Generating FrameNets of various granularities: The FrameNet Transformer

Customisable Curation Workflows in Argo

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

Enabling Semantic Search in Large Open Source Communities

Unstructured Information Management Architecture (UIMA) Graham Wilcock University of Helsinki

Mention Detection: Heuristics for the OntoNotes annotations

Introduction to IE and ANNIE

A BNC-like corpus of American English

Morpho-syntactic Analysis with the Stanford CoreNLP

Voting between Multiple Data Representations for Text Chunking

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

D4.6 Data Value Chain Database v2

The Multilingual Language Library

BD003: Introduction to NLP Part 2 Information Extraction

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Tagging and parsing German using Spejd

Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK

Automated Extraction of Event Details from Text Snippets

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Transforming Requirements into MDA from User Stories to CIM

Assignment #1: Named Entity Recognition

System Combination Using Joint, Binarised Feature Vectors

Exam Marco Kuhlmann. This exam consists of three parts:

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Lecture 14: Annotation

Mining Aspects in Requirements

Maximum Entropy based Natural Language Interface for Relational Database

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

LAB 3: Text processing + Apache OpenNLP

* Overview. Ontology-Guided Information Extraction from Pathology Reports The SWPatho Project David Schlangen Universität Potsdam

CACAO PROJECT AT THE 2009 TASK

NLP - Based Expert System for Database Design and Development

Precise Medication Extraction using Agile Text Mining

Learning Latent Linguistic Structure to Optimize End Tasks. David A. Smith with Jason Naradowsky and Xiaoye Tiger Wu

I Know Your Name: Named Entity Recognition and Structural Parsing

Evaluation of Named Entity Recognition in Dutch online criminal complaints

TIPSTER Text Phase II Architecture Requirements

A Multilingual Social Media Linguistic Corpus

Discriminative Training with Perceptron Algorithm for POS Tagging Task

WebAnno: a flexible, web-based annotation tool for CLARIN

structure of the presentation Frame Semantics knowledge-representation in larger-scale structures the concept of frame

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Anette Frank, Markus Becker, Berthold Crysmann, Bernd Kiefer and Ulrich Schäfer

Automatic Evaluation of Parser Robustness: Eliminating Manual Labor and Annotated Resources

ANC2Go: A Web Application for Customized Corpus Creation

A Linguistic Approach for Semantic Web Service Discovery

Prediction-Based NLP System by Boyer-Moore Algorithm for Requirements Elicitation

Lab II - Product Specification Outline. CS 411W Lab II. Prototype Product Specification For CLASH. Professor Janet Brunelle Professor Hill Price

Information Retrieval

Corpus methods for sociolinguistics. Emily M. Bender NWAV 31 - October 10, 2002

A platform for collaborative semantic annotation

A Textual Entailment System using Web based Machine Translation System

Question Answering Approach Using a WordNet-based Answer Type Taxonomy

English Understanding: From Annotations to AMRs

Transcription:

DEEPTHOUGHT Hybrid Deep and Shallow Methods for Knowledge-Intensive Information Extraction Deliverable D1.4 Report Describing Integration Strategies and Experiments The Consortium October 2004

Report Describing Integration Strategies and Experiments D1.4 Project ref. no. Project acronym Project full title - Hybrid Deep and Shallow Methods for Knowledge-Intensive Information Extraction Security (distribution level) Public Contractual date of delivery 15.10.2004 Actual date of delivery 15.10.2004 Deliverable number D1.4 Deliverable name Type Status & version Number of pages 9 WP contributing to the deliverable WP / Task responsible Other contributors Author(s) EC Project Officer Keywords Abstract Report Describing Integration Strategies and Experiments Report Final WP1b WP1b John Carroll, Alex Fang, Melanie Siegel Evangelia Markidou Hybrid NLP, Named-Entity Recognition, architecture The implemented strategies for hybrid NLP are described and examples are given using screenshots. II

Content 1 Integration Strategies in the Heart of Gold 2 2 Mobile Phone Name Recognition for English 5 2.1 Input and Output Specification... 5 2.2 Construction of an annotated sub-corpus... 7 2.3 The recognition program... 8 2.4 A quantitative evaluation of the mobile phone name recogniser... 9 2.5 Error Analysis... 11 3 References 11 1

1 Integration Strategies in the Heart of Gold Implemented strategies for hybrid NLP in the project include: The analysis results of NLP tools at lower processing levels can be used by components at higher levels. o For example, the deep linguistic analysis module PET uses default lexicon entries for Named-Entities that the Named-Entity Recognition Sprout delivers. o For example, the deep linguistic analysis module PET uses default lexicon entries for Part-of-Speech tags that the POS tagger TnT delivers. Deliver the deepest result found. If a module of the required depth cannot deliver a result, deliver the next deepest result. This is the approach that the email autoresponse application mainly follows. 2

Deliver partial results, whenever a complete analysis is not available. Partial results are taken from the deepest module that delivers results. Combine modules and grammars for different languages. Each language has its own configuration of valid modules and grammars. 3

The different modules use a compatible output formalism, RMRS. o In case of shallower modules, this robust semantic structure allows for underspecification of, e.g., argument structure. Refine the data provided by shallower modules through deep parsing. This is a strategy the applications Business Intelligence and Email Autoresponse use. Chunk processing and named-entity recognition is used to find relevant information sources, while deep processing is then applied to the found information snippets, either to verify or to filter the extracted information. 4

2 Mobile Phone Name Recognition for English We describe below the construction and evaluation of a module for named entity recognition of mobile phone names. The module was integrated into the RASP English shallow analysis system, which in turn forms part of the Heart of Gold. On a manually annotated test set, the module achieved a recognition F-score of 81.5%. 2.1 Input and Output Specification The input to the mobile phone name recognition module is a sequence of sentences in English that have already been marked up in XML style for word boundaries, with part-of-speech tags automatically assigned by RASP (Briscoe and Carroll 2002). Since this happens before the morphological analyser in the RASP pipeline, the tokens have not been lemmatized. For example, given the sentence I am thinking of upgrading to the Sony Ericsson T68is from a Nokia 8260, the input to the module is: ^ ^ <w s='2' e='2'>i</w> PPIS1 <w s='4' e='5'>am</w> VBM <w s='7' e='14'>thinking</w> VVG <w s='16' e='17'>of</w> IO <w s='19' e='27'>upgrading</w> NN1 <w s='29' e='30'>to</w> II <w s='32' e='34'>the</w> AT <w s='36' e='39'>sony</w> NP1 <w s='41' e='48'>ericsson</w> NP1 <w s='50' e='54'>t68is</w> NN1 <w s='56' e='59'>from</w> II <w s='61' e='61'>a</w> AT1 <w s='63' e='67'>nokia</w> NP1 <w s='69' e='72'>8260</w> MC ^ ^ The task of the module is to mark up the mobile phone named entities in the input, namely Sony Ericsson T68is and Nokia 8260 in this example: ^ ^ 5

<w s='2' e='2'>i</w> PPIS1 <w s='4' e='5'>am</w> VBM <w s='7' e='14'>thinking</w> VVG <w s='16' e='17'>of</w> IO <w s='19' e='27'>upgrading</w> NN1 <w s='29' e='30'>to</w> II <w s='32' e='34'>the</w> AT <w netype='phone'> <w s='36' e='39'>sony</w> <w s='41' e='48'>ericsson</w> <w s='50' e='54'>t68is</w> </w> NP <w s='56' e='59'>from</w> II <w s='61' e='61'>a</w> AT1 <w netype='phone'> <w s='63' e='67'>nokia</w> <w s='69' e='72'>8260</w> </w> NP ^ ^ where Sony Ericsson T68is and Nokia 8260 are marked up as named entities of type mobile phone (i.e. netype='phone'). They are then treated as a single unit tagged as NP, namely, a proper name. The analysis based on this output from the module will be taken further down the RASP pipeline and yield the following RMRS representation: 6

2.2 Construction of an annotated sub-corpus For work described in Workpackage 2B, a 4,000,000-word corpus of Internet discussions on mobile phones was created for the domain-specific extraction of a verb subcategorisation lexicon (Carroll and Fang 2004). From this corpus, we randomly selected two sets of 200 texts each. Each text was then manually annotated such that each instance of a mobile phone name, a model number, or any combination of the two was marked up as an entity (<mobile> and </mobile>). Here is an example: I have for sale the following ORIGINAL <mobile> Nokia </mobile> accessories that will fit any of the <mobile> Nokia 6100 </mobile> / <mobile> 5100 </mobile> series phones, including but not limited to <mobile> 6160 </mobile>, <mobile> 6190 </mobile>, <mobile> 6188 </mobile>, <mobile> 6185 </mobile>, <mobile> 6162 </mobile>, <mobile> 6161 </mobile>, <mobile> 6185i </mobile>, <mobile> 5160 </mobile>, <mobile> 5190 </mobile>, etc. 7

The two sets are summarised in Table 1: Texts Sentences Words Entities Set 1 200 3314 60804 447 Set 2 200 2624 46117 454 Total 400 5938 106921 901 Table 1: A summary of the annotated corpus 2.3 The recognition program The automatic recogniser was implemented in C. The algorithm was designed based on the observation that the distribution of mobile phone names in our corpus is relatively sparse. There is insufficient data to train a purely statistical recogniser (e.g. a Maximum Entropy Model); it may however be possible to train a combined symbolic/statistical model (incorporating information for example on manufacturer names). A set of mobile phone manufacturer names, such as Nokia and Ericsson, was manually drawn up. The remainder of the mobile phone corpus that had not been annotated (ca 2,800,000 tokens) was then used to construct a list of all the alphanumeric strings that contain at least 1 digit and that immediately follow one of these names. This process resulted in two entity sets: a list of mobile phone names a list of model numbers with their associated mobile phone names The automatic recogniser marks the following as an entity: every occurrence of the mobile phone names every occurrence of the model numbers, given the following conditions they are longer than 3 characters in length they occurred more than once in the training corpus they occurred less than 2000 1 times in the training corpus 1 Numbers occurring more than 2000 times are interpreted as genuine "free" cardinals that are unlikely to be used in reference to a mobile phone. 8

2.4 A quantitative evaluation of the mobile phone name recogniser For the quantitative evaluation of the mobile phone name recogniser's performance, the first annotated set was used for development and the second set was kept for testing. Both sets were sub-divided into 4 sets containing the same number of word tokens with a view to indicate any possible variation in terms of performance. The initial run of the recogniser on the development set produced the following results: 1 2 3 4 Total Precision 63.0 69.7 87.9 73.8 75.8 Recall 96.7 84.8 79.7 84.3 83.4 F-Score 76.3 76.5 83.6 78.7 79.4 Table 2: Performance before tuning on the development set The F-Score for the development set was just under 80%. Variations across the four sub-sets can be observed, with Set 3 showing the best F-Score of 83.6%. The output was manually inspected and changes made to the list of mobile phone names and model numbers. Subsequent performance on the development set shows an F-Score of 82.1%, an increase of nearly 3% from the previous 79.4%: 1 2 3 4 Total Precision 61.7 74.9 89.2 74.1 78.3 Recall 96.7 90.3 81.3 85.7 86.4 F-Score 75.3 81.9 85.1 79.5 82.1 Table 3: Performance after tuning on the development set When tested on the test set, the recogniser achieved an overall performance of 81.5%, with a precision score of 81% and a recall rate of 81.9%: 1 2 3 4 Total 9

Precision 85.9 69.7 83.5 92.9 81.0 Recall 97.0 77.1 71.7 81.2 81.9 F-Score 91.1 73.2 77.1 86.7 81.5 Table 4: Performance on the test set As can be observed from the table above, the best performance was 91.1% F-Score and the worst performance was 73.2%, showing considerable variation in this set and therefore suggesting that the performance of the system varies with different types of input. 10

2.5 Error Analysis There are two major sources of errors. First of all, there is frequent ambiguity between phone names and company names, as in the following example: I was wondering if anyone has any information on how the Ericsson Bluetooth kits calculates the BER packets when the BER test is run. where Ericsson can be analysed as referring to the company instead of the phone. Arguably, this is a genuinely ambiguous case. The second major ambiguity is between numbers and model numbers: There are 2 connectors on the cable, 1 RS 232 and 1 cigarette lighter. Since 232 has been observed before as co-occurring with mobile phone names, the module believes that in the current context it refers to a mobile phone product and therefore erroneously marks it as a phone name. 3 References Briscoe, E. and J. Carroll. 2002. Robust accurate statistical annotation of general text. In Proceedings of the 3 rd International Conference on Language Resources and Evaluation, Las Palmas, Gran Canaria. 1499-1504. Carroll, J. and A.C. Fang. 2004. The Automatic Acquisition of Verb Subcategorisations and their Impact on an HPSG Parser. In Proceedings of the 1 st International Joint Conference on Natural Language Processing, 22-24 March 2004, Hainan, China. Uszkoreit, Hans, Ulrich Callmeier, Andreas Eisele, Ulrich Schäfer, Melanie Siegel, Jakob Uszkoreit (2004): Hybrid Robust Deep and Shallow Semantic Processing for Creativity Support in Document Production. In Proceedings of KONVENS 2004, Vienna, Austria. Callmeier, Ulrich, Eisele, Andreas, Schäfer, Ulrich and Melanie Siegel (2004): The Core Architecture Framework. In Proceedings of LREC 04, Lisbon, Portugal. 11