The Muc7 T Corpus. 1 Introduction. 2 Creation of Muc7 T

Similar documents
UIMA-based Annotation Type System for a Text Mining Architecture

An UIMA based Tool Suite for Semantic Text Processing

Final Project Discussion. Adam Meyers Montclair State University

A Technical Introduction to the Semantic Search Engine SeMedico

The Goal of this Document. Where to Start?

ANC2Go: A Web Application for Customized Corpus Creation

An Adaptive Framework for Named Entity Combination

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Army Research Laboratory

Annotating Spatio-Temporal Information in Documents

WebAnno: a flexible, web-based annotation tool for CLARIN

Annual Public Report - Project Year 2 November 2012

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF

XML Support for Annotated Language Resources

NLP4BPM - Natural Language Processing Tools for Business Process Management

IBM Watson Application Developer Workshop. Watson Knowledge Studio: Building a Machine-learning Annotator with Watson Knowledge Studio.

The Heterogeneous Collection Track at INEX 2006

Chapter 2. Architecture of a Search Engine

ACE (Automatic Content Extraction) English Annotation Guidelines for Values. Version

Towards a Taxonomy of Approaches for Mining of Source Code Repositories

Aid to spatial navigation within a UIMA annotation index

Production Report of the TRAD Parallel Corpus Chinese-French Translation of a subset of GALE Phase 1 Chinese Blog Parallel Text (LDC2008T06)

Making Sense Out of the Web

This corpus has been produced within the framework of the PEA-TRAD project ( )

QUALIBETA at the NTCIR-11 Math 2 Task: An Attempt to Query Math Collections

EU Legal Databases Manual

Automatic Domain Partitioning for Multi-Domain Learning

For convenience in typing examples, we can shorten the wordnet name to wn.

Apache UIMA and Mayo ctakes

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Building and Annotating Corpora of Collaborative Authoring in Wikipedia

Center for Reflected Text Analytics. Lecture 2 Annotation tools & Segmentation

This chapter describes the encoding scheme supported through the Java API.

CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools

Chapter 4 Research Prototype

Tools for Annotating and Searching Corpora Practical Session 1: Annotating

Implementing a Variety of Linguistic Annotations

An Archiving System for Managing Evolution in the Data Web

Experiences with UIMA in NLP teaching and research. Manuela Kunze, Dietmar Rösner

Influence of Text Type and Text Length on Anaphoric Annotation

Semi-Supervised Learning of Named Entity Substructure

Annotation Science From Theory to Practice and Use Introduction A bit of history

Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population

INTERNATIONAL CONFERENCE ON ENGINEERING DESIGN ICED 01 GLASGOW, AUGUST 21-23, 2001

A Quick Guide to MaltParser Optimization

Computing Similarity between Cultural Heritage Items using Multimodal Features

Dagstuhl Seminar on Service-Oriented Computing Session Summary Cross Cutting Concerns. Heiko Ludwig, Charles Petrie

Automatic Metadata Extraction for Archival Description and Access

Using the Ellogon Natural Language Engineering Infrastructure

Fast and Effective System for Name Entity Recognition on Big Data

Self-tuning ongoing terminology extraction retrained on terminology validation decisions

Tutorial 2: Validating Documents with DTDs

Overview. Introduction. Introduction XML XML. Lecture 16 Introduction to XML. Boriana Koleva Room: C54

Error annotation in adjective noun (AN) combinations

Annotation Procedure in Building the Prague Czech-English Dependency Treebank

OWL as a Target for Information Extraction Systems

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A Multilingual Social Media Linguistic Corpus

Ontology-guided Extraction of Complex Nested Relationships

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Programming the Semantic Web

AIM. 10 September

SAPIENT Automation project

Customisable Curation Workflows in Argo

Getting Started with DKPro Agreement

Benedikt Perak, * Filip Rodik,

Machine Learning in GATE

The KNIME Text Processing Plugin

CACAO PROJECT AT THE 2009 TASK

International Journal for Management Science And Technology (IJMST)

Measurements of the effect of linear interpolation values and reduced bigram model size for text prediction

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

Annotation and Feature Engineering

Two Practical Rhetorical Structure Theory Parsers

Ortolang Tools : MarsaTag

What is this Song About?: Identification of Keywords in Bollywood Lyrics

META-SHARE : the open exchange platform Overview-Current State-Towards v3.0

Mymory: Enhancing a Semantic Wiki with Context Annotations

Deliverable D Multilingual corpus acquisition software

UNIT I. A protocol is a precise set of rules defining how components communicate, the format of addresses, how data is split into packets

Interoperability for Digital Libraries

NLP Final Project Fall 2015, Due Friday, December 18

A Case Study on the Similarity Between Source Code and Bug Reports Vocabularies

D6.1 Project External Website, project flyer and social media presence

A DIGITAL TELETEXT SERVICE

Data and Information Integration: Information Extraction

Metadata and the Semantic Web and CREAM 0

Production Report of the TRAD Parallel Corpus Arabic-French Translation of a subset of NIST 2008 Open Machine Translation Evaluation (LDC2010T21)

Question Answering Systems

Language Resources and Linked Data

Writing Report Techniques

Better translations with user collaboration - Integrated MT at Microsoft

Similarity Overlap Metric and Greedy String Tiling at PAN 2012: Plagiarism Detection

Generating FrameNets of various granularities: The FrameNet Transformer

Graph-based Entity Linking using Shortest Path

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

GernEdiT: A Graphical Tool for GermaNet Development

Story Workbench Quickstart Guide Version 1.2.0

A platform for collaborative semantic annotation

Project Presentation v2

Transcription:

The Muc7 T Corpus Katrin Tomanek and Udo Hahn Jena University Language & Information Engineering (JULIE) Lab Friedrich-Schiller-Universität Jena, Germany {katrin.tomanek udo.hahn}@uni-jena.de 1 Introduction This document gives a short description of the creation of the Muc7 T corpus, its package structure, and the underlying data format. Finally, two use cases of Muc7 T are briefly described. 2 Creation of Muc7 T Muc7 T is an extension of the Muc7 corpus (Linguistic Data Consortium, 2001), where we couple common named entity annotation metadata with a time stamp which indicates the time measured for the linguistic decision making process. 1 Therefore, we ran a re-annotation initiative which targeted the named entity annotations (Enamex) of the English part of the Muc7 corpus, viz. persons, locations, and organizations. The annotation was done by two advanced students of linguistics with good English language skills. For consistency reasons, the original guidelines of the Muc7 named entity task were used. 2.1 Data The original Muc7 corpus consists of three distinct document sets for the named entity task (train, test, and dry run). We used the test set to train the annotators and to develop the annotation design. The Muc7 T corpus consists of the articles from the train set which comprises 100 articles reporting on airplane crashes. We had to split lengthy documents (27 out of the 100) into halves so that they fitted in the screen of the annotation GUI without the need for scrolling. 2 Still, we had to exclude the following 1 These time stamps should not be confounded with the annotation of temporal expressions (e.g., Timex in Muc7). 2 We aimed at avoiding scrolling in order to keep the mechanical overhead of the actual annotation procedure as low as possible so that the annotation times would reflect basically the cognitive processes, only. 1

Figure 1: Screenshot of the annotation GUI showing an annotation example where the complex noun phrase GTE Airfone services is highlighted for annotation. two documents due to extreme over-length which would have required overly many splits: nyt960718.0792 and nyt960721.0140. 2.2 Annotation Principles In the Muc7 T corpus, annotation time measurements are recorded for two syntactically different annotation units: (a) complete sentences and (b) complex noun phrases (CNPs) which are top-level noun phrases in the constituency structure of the respective sentence. The annotation task was defined such as to assign an entity type label to each token of an annotation unit. Please refer to Tomanek and Hahn (2010) for a discussion why CNPs were used and how these were derived automatically. For precise time measurements, single annotation examples were shown to the annotators, one at a time. An annotation example consists of the chosen Muc7 document with one annotation unit (sentence or CNP) selected and highlighted (yet, without annotation). Only the highlighted part of the document could be annotated and the annotators were asked to read only as much of the visible context surrounding the annotation unit as necessary to make a proper annotation decision. Figure 1 shows a screenshot of the annotation GUI. Annotation was performed in blocks of 500 CNP-level or 100 sentencelevel annotation examples. In the Muc7 T corpus, annotation time meta data of both annotators is available on the CNP- and the sentence-level. 2

Further details on the creation and annotations principles of Muc7 T can be found in Tomanek and Hahn (2010). In this paper, you will also find an analysis of the inter-annotator agreement (Cohen s Kappa is κ 0.95 for both annotators) as well as other corpus statistics. 3 Package Structure Figure 2 shows the directory structure of the Muc7 T short descriptions and remarks on each subdirectory: package. Some very data This directory contains the actual Muc7 T data. You will find the data for annotator A and B, each separately. For both annotators, there is a version of Muc7 T with CNP-level and with sentence-level annotations. Section 4 discusses the used XML format in more detail. docs Contains this documentation as well as publications describing applications of Muc7 T. There is also a small JavaDoc for the Java tools (see below). dtd You will find the Document Type Definition (DTD) for the data here. tools There is a small Java API which allows to read the Muc7 T XML data so that each annotation example is represented by a Java object. Besides the source code, you will also find a jar package. The code has been tested with Java 1.5 and Java 1.6. 4 Data Format The Muc7 T corpus is stored in XML format. See Figure 3 for the respective DTD. There is an element anno example for each annotation example. It has the original Muc7 document as text context. The Muc7 document was tokenized using the Stanford Tokenizer 3 with white spaces marking token boundaries. The following attributes are used for the element anno example: anno time: The time it took to annotate the annotation unit of this annotation example (time in milliseconds). 3 The tokenizer is part of the Stanford Parser package which can be obtained from http://nlp.stanford.edu/software/lex-parser.shtml. 3

-- data -- annotatora -- CNPs -- sents -- annotatorb -- CNPs -- sents -- docs -- publications -- JavaDoc -- dtd -- tools -- src -- de -- julielab -- muc7timed Figure 2: Directory structure of the Muc7 T package. anno unit tokens: All tokens of the annotation unit. anno unit offset: Offsets for the tokens of the annotation unit relative to all tokens in the annotation example. anno unit labels: Labels for the tokens of the annotation unit (these labels are taken from Muc7). doc id: ID of the document of the annotation example. sent id: ID of the sentence of the annotation example. anno unit id: ID of the unit of the annotation example. All three ids (doc id, sent id, and anno unit id) jointly yield a unique identifier of this annotation example. Moreover, they allow to regroup or reorder the annotation examples, e.g., by document or sentence. They can also be used as links between the CNP-level and the document-level version of Muc7 T. muc7 org filename: The name of the original Muc7 document from which this annotation example is taken. Figure 4 shows such an annotation example in XML format. It stems from the original Muc7 file nyt960721.0261. It took about 4.5 seconds to annotate. The annotation unit (a CNP, here) consists of the 4th to 11th token (we start counting from 0) of the annotation example text; these tokens are the crash of Pan Am 103 over Lockerbie and Pan Am is marked as an organization and Lockerbie as a location. This annotation unit 4

<!ELEMENT anno_examples (anno_example)+> <!ELEMENT anno_example (#PCDATA)> <!ATTLIST anno_example anno_time CDATA #REQUIRED anno_unit_tokens CDATA #REQUIRED anno_unit_offset CDATA #REQUIRED anno_unit_labels CDATA #REQUIRED doc_id CDATA #REQUIRED sent_id CDATA #REQUIRED anno_unit_id CDATA #REQUIRED muc7_org_filename CDATA #REQUIRED > Figure 3: Document Type Definition (DTD) of the XML format. was taken from the 68th Muc7 document, is in the 1st sentence of this document, and therein is the 1st CNP. The next CNP-level annotation example of this example would be for the annotation unit Scotland having anno unit id="2" (doc id and sent id would stay the same). 5 Use Cases of Muc7 T We created Muc7 T focusing on two main purposes both in the context of resource- and cost-conscious annotation strategies. On the one hand, it can be used for evaluations of selective sampling strategies, such as Active Learning (Cohn et al., 1996) instead of empirically questionable assumptions on the necessary annotation efforts (e.g., the assumption of the uniformity of annotation costs over the number of linguistic units, typically tokens, to be annotated), Muc7 T now allows to run repeatable simulations on selective sampling strategies where the annotation effort can be expressed by the actual time needed to annotate a selected item. This use case is described in more detail in Tomanek and Hahn (2010) and Tomanek (2010). Another use case for Muc7 T is the creation of predictive models for annotation costs. Such models are needed when selective sampling strategies, such as Active Learning, should not only select on the basis of estimated informativeness or utility of an example (to be maximized), but also taking into account the estimated time this example would require for annotation (to be minimized). As annotation costs are not known prior to annotation, their quantity has to be estimated. In Tomanek et al. (2010), we describe an empirical study where the annotators reading behavior was observed with an eye-tracking device while a corpus was annotated. With the insights on factors influencing annotation time which we gathered through this study, we were able to induce such a much needed predictive model of annotation costs. 5

<anno_example anno_time="4448" anno_unit_labels="o O O ORGANIZATION ORGANIZATION O O LOCATION" anno_unit_offset="4,5,6,7,8,9,10,11" anno_unit_tokens="the crash of Pan Am 103 over Lockerbie" doc_id="68" sent_id="2" anno_unit_id="1" muc7_org_filename="nyt960721.0261" > MORICHES, N.Y. After the crash of Pan Am 103 over Lockerbie, Scotland, in December 1988, it took investigators seven days to determine that the cause was a bomb. But after a Boeing 737 crashed on approach to Pittsburgh in September 1994 which was [...] </anno_example> Figure 4: A sample annotation example in XML format. Acknowledgements The construction of the Muc7 T corpus was partially funded by the BOOT- Strep Project (FP6-028099) within the 6th Framework Programme of the European Commission. References [Cohn et al. 1996] Cohn, David; Ghahramani, Zoubin; Jordan, Michael: Active Learning with Statistical Models. In: Journal of Artificial Intelligence Research 4 (1996), pp. 129 145. [Linguistic Data Consortium 2001] Linguistic Data Consortium: Message Understanding Conference (MUC) 7. LDC2001T02. FTP FILE. 2001. Philadelphia: Linguistic Data Consortium. [Tomanek 2010] Tomanek, Katrin: Resource-Aware Annotation through Active Learning, Technical University of Dortmund, Dissertation, 2010. [Tomanek and Hahn 2010] Tomanek, Katrin; Hahn, Udo: Annotation Time Stamps Temporal Metadata from the Linguistic Annotation Process. In: LREC 10: Proceedings of the 7th International Conference on Language Resources and Evaluation, European Language Resources Association, 2010. 6

[Tomanek et al. 2010] Tomanek, Katrin; Hahn, Udo; Lohmann, Steffen; Ziegler, Jürgen: A Cognitive Cost Model of Annotations Based on Eye-Tracking Data. In: ACL 10: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2010. 7